Blog/AI Training
AI Training2026-03-1220 min read

Choosing the Right GPU for LLM Training in 2026: A Practitioner's Guide

After overseeing dozens of LLM training deployments, here is what actually determines training speed, cost, and reliability — and which GPU fits each model scale.

Selecting GPUs for LLM training is one of those decisions that looks straightforward from a distance — just pick the fastest GPU you can afford, right? — but becomes surprisingly complex once you start accounting for memory constraints, interconnect topology, software compatibility, and the cascading cost implications of each choice.

I have been involved in GPU procurement for LLM training since the GPT-3 era, when the H100 did not yet exist and teams were stitching together clusters of A100s with varying degrees of success. The landscape has changed enormously since then, but many of the fundamental principles remain the same. This guide distills what I have learned into actionable advice for each model scale.

The Three Bottlenecks That Define LLM Training Performance

Every LLM training run is constrained by one of three bottlenecks, and the dominant bottleneck shifts depending on model size, batch size, and cluster configuration. Understanding which bottleneck you are hitting is more important than understanding the GPU spec sheet.

Bottleneck 1: Compute Throughput

This is the bottleneck most people think of first — raw TFLOPS. During the forward and backward passes, matrix multiplications dominate compute time, and these operations scale directly with FP16/BF16 throughput. A GPU with 2x the TFLOPS will process tokens roughly 2x faster, assuming the other bottlenecks are not dominant.

In practice, compute is the bottleneck for small to medium models (7B-34B) on high-bandwidth GPUs (H200, B200, B300). At this scale, the model fits comfortably in memory, communication overhead is manageable, and the GPU spends most of its time doing useful work.

Bottleneck 2: Memory Bandwidth

The attention mechanism in transformers requires reading the key-value cache from HBM on every token generation step. This operation is memory-bandwidth bound — the compute units sit idle waiting for data to arrive from memory. For inference and for the attention computation during training, memory bandwidth often matters more than peak TFLOPS.

This is why the jump from H100 (3,350 GB/s) to H200 (4,800 GB/s) felt disproportionately impactful for many workloads — a 43% bandwidth increase translated to 30-40% faster attention computation, even though the TFLOPS number was identical. The B300 Ultra at 8,000 GB/s takes this further, effectively eliminating bandwidth as a bottleneck for all but the longest sequence lengths.

Bottleneck 3: Communication Overhead

Once you distribute a model across multiple GPUs — which is necessary for any model larger than about 30B parameters — every training step requires synchronizing data between GPUs. Gradient all-reduce, tensor parallel communication, pipeline stage handoffs — these operations add latency that scales with the number of GPUs and the volume of data being communicated.

At 64+ GPUs, communication overhead can consume 30-50% of total training time on poorly configured clusters. This is why interconnect bandwidth (NVLink within a node, InfiniBand between nodes) is not a nice-to-have — it is a primary determinant of training efficiency at scale.

GPU Recommendations by Model Scale

7B - 13B Parameters: Maximum Flexibility

At this scale, almost any modern GPU works. A 7B model in BF16 requires roughly 14GB for weights, and with optimizer states and activations, a single training step needs 40-60GB depending on batch size and sequence length. An A100 80GB or any larger GPU handles this comfortably.

The question is not "can this GPU train a 7B model?" — they all can — but "what is the most cost-effective way to do it?"

For one-off training runs or experimentation, cloud is the clear winner. Lambda Labs at $2.49/GPU/hr on H100s means a full Chinchilla-optimal 7B training run (140 billion tokens) costs approximately $200-$300 on 8 GPUs. Even on AWS at $12.29/GPU/hr, you are looking at under $1,500. At these price points, owning hardware for 7B training only makes sense if you are running multiple training jobs per week continuously.

For continuous fine-tuning pipelines — where you retrain or adapt a 7B model daily on new data — the economics shift. A refurbished 8x A100 80GB system costs $80,000-$100,000 and draws roughly 4kW with cooling. At $0.10/kWh, annual power cost is about $3,500. That means the system pays for itself versus Lambda Labs pricing in roughly 8-10 months of continuous utilization. Versus AWS, payback is under 3 months.

Our pick for 7B-13B: Cloud H100s for experimentation, on-premise A100 80GB for production fine-tuning pipelines. The A100's lower power draw (400W vs 700W) and mature software support make it the workhorse GPU for this scale.

34B - 70B Parameters: Where GPU Selection Actually Matters

This is the sweet spot where GPU choice has the most impact on cost and feasibility. A 70B model pushes the boundaries of what fits on a single node, and the parallelism strategy you choose cascades into GPU requirements.

The memory math for 70B mixed-precision training with AdamW:

ComponentMemory (BF16 training)
Model weights (BF16)~140 GB
Gradients (BF16)~140 GB
Optimizer states (FP32)~560 GB
Activation memory (varies)~50-200 GB
Total~890 - 1,040 GB

This total exceeds any single GPU's memory, so you must distribute. The question is how.

Option A: ZeRO-3 (data parallelism with full sharding) — distributes weights, gradients, and optimizer states across all GPUs. Each GPU holds 1/Nth of everything and communicates heavily during forward and backward passes. Works well when you have high-bandwidth interconnects (NVLink within nodes, InfiniBand between nodes). Typical configuration: 32-64 GPUs.

Option B: Tensor parallelism + ZeRO-1 — splits individual layers across GPUs within a node (requires NVLink), with data parallelism across nodes. This approach achieves higher MFU than ZeRO-3 because intra-layer communication stays on NVLink (fast) while only gradient synchronization goes over InfiniBand (slower). Typical configuration: 4-8 way TP within nodes, 4-16 way DP across nodes.

Option C: Pipeline parallelism + tensor parallelism — splits the model into stages (groups of layers) assigned to different nodes, with tensor parallelism within each stage. This minimizes inter-node communication volume but introduces pipeline bubbles (idle time). Best for very large models (175B+) or when inter-node bandwidth is limited.

For 70B specifically, Option B is usually optimal. And for Option B to work well, you want GPUs with enough per-GPU memory to hold at least one full attention layer plus activations — which means 80GB is the practical minimum. 141GB (H200) or 192GB (MI300X) is much more comfortable and allows larger per-GPU batch sizes, which improves compute efficiency.

Our pick for 70B: H200 SXM (32 GPUs, 4 nodes) or MI300X (32 GPUs, 4 nodes). The H200 offers the best bandwidth-to-cost ratio. The MI300X offers more memory at a lower price, which is advantageous if your sequence lengths are long (8K+ tokens) and activation memory is a constraint.

175B+ Parameters: No Room for Error

At this scale, you are committing to infrastructure decisions that cost millions of dollars and will run for months. Training a 175B model with Chinchilla-optimal tokens (3.5 trillion) on 128x H200 GPUs takes approximately 25-30 days. On 128x B300 Ultra GPUs, roughly 14-16 days.

The time difference matters more than it might seem. A training run that takes 30 days has a much higher probability of experiencing a hardware failure (GPU ECC error, InfiniBand link flap, storage node crash) than one that takes 15 days. Each failure requires either checkpoint recovery (losing 30-60 minutes of training) or manual intervention. Over a 30-day run, you might lose 2-3 days to failures and recovery. Over a 15-day run, perhaps 1 day. The B300 Ultra's faster throughput effectively reduces the surface area for things to go wrong.

Our pick for 175B+: B300 Ultra or B200 in clusters of 128-512 GPUs. At this investment level, the NVIDIA ecosystem's reliability and tooling (NCCL, cuDNN, Nsight profiling, NVIDIA-qualified InfiniBand configurations) justify the premium. A failed training run on 256 GPUs wastes hundreds of thousands of dollars — reliability matters more than per-GPU cost.

Cost Modeling Across Providers

Model SizeGPU ConfigEst. TimeLambda LabsCoreWeaveAWS
7B8x H100~10 hrs$200$380$980
13B8x H100~36 hrs$720$1,370$3,540
34B16x H200~4 days$3,800$7,300$18,900
70B32x H200~6 days$11,500$22,000$56,700
175B128x B200~25 daysN/A~$365,000~$950,000
405B256x B300~30 daysN/AN/A~$2,800,000

These estimates assume 35% MFU and Chinchilla-optimal token counts (20x parameters). Your actual MFU will depend on your framework, parallelism strategy, and optimization effort. Well-optimized codebases on B300 Ultra can achieve 45-50% MFU, which would reduce both time and cost proportionally. Poorly optimized codebases on older GPUs might hit 25% MFU, increasing costs by 40%.

Model your own scenario with our AI Compute Cost Calculator.

Hard-Won Lessons from Production Training

These are things I wish someone had told me before my first large-scale training run:

Checkpoint frequently and verify checkpoints work. Save checkpoints every 500-1000 steps, and periodically test that you can actually resume from a checkpoint. We once lost 3 days of training on a 128-GPU cluster because our checkpoint saving code had a subtle bug that corrupted the optimizer state. We only discovered it when the training run crashed and the "checkpoint" failed to load.

Monitor GPU memory utilization, not just GPU compute utilization. nvidia-smi shows compute utilization, but a GPU can show 95% compute utilization while actually being bottlenecked by memory bandwidth. Use Nsight Systems or rocprof to identify whether your kernels are compute-bound or memory-bound.

Budget 15-20% overhead for failures and maintenance. Over a 30-day training run on 128 GPUs, you will almost certainly experience at least one GPU ECC error, one InfiniBand timeout, and one unexplained hang. If your project plan assumes zero downtime, you will miss your deadline.

Run a 24-48 hour pilot before committing to a multi-week run. Rent a small cluster (8-16 GPUs) for two days and run your actual training code. Measure throughput, verify checkpoint saving and loading, and identify any software issues. The cost of a pilot ($500-$2,000) is trivial compared to the cost of discovering a problem in week three of a $200,000 training run.

Interconnect matters more than you think. We benchmarked the same 70B training job on two clusters with identical GPU counts but different networking: one with InfiniBand HDR (200Gbps) and one with InfiniBand NDR (400Gbps). The NDR cluster was 18% faster despite having identical GPUs. At scale, networking is not a secondary concern — budget for the best interconnect you can afford.

LLMAI trainingGPU selectiondistributed trainingmodel parallelism

Try Our GPU Tools

Compare GPUs, calculate TCO, and get AI-powered recommendations.