H100 vs MI300X — which GPU should I choose?

H100 if you need the broadest software ecosystem (CUDA, TensorRT, vLLM). MI300X if you need maximum VRAM (192GB vs 80GB) for large model inference. MI300X offers better $/TFLOP but NVIDIA's software stack is more mature.

How much VRAM do I need for LLM inference?

A model needs ~2x its parameter count in GB for FP16 inference (70B model = ~140GB VRAM). With INT8 quantization ~70GB, INT4 ~35GB. A single H100 (80GB) runs 70B at INT8; MI300X (192GB) runs it at full FP16.

Should we buy GPUs or use cloud GPU instances?

If running GPUs 60%+ of the time, on-premise ownership wins on 3-year TCO. Below 40% utilization, cloud is more cost-effective. Many enterprises use hybrid: owned hardware for baseline, cloud for peak demand.

What is the cheapest way to rent an H100 GPU?

As of 2026, H100 cloud pricing ranges from $2.23/hr (Lambda, RunPod spot) to $4+/hr (AWS, Azure on-demand). Reserved instances and spot pricing offer 30-60% savings. CoreWeave and Lambda typically offer the lowest rates.

What GPU is best for LLM training in 2026?

NVIDIA H200 SXM (141GB HBM3e) for proven clusters, B200 for next-gen 4-5x speedup over H100, or AMD MI300X (192GB) for budget-conscious teams. For JAX workloads, Google TPU v5p pods offer unmatched scale.

Best GPU for Fine-Tuning LLMs in 2026

Fine-tuning requirements vary dramatically: QLoRA on a 7B model needs ~10GB VRAM; full fine-tuning of a 70B model needs 500GB+. Matching GPU to your specific technique — LoRA, QLoRA, PEFT, or full — is critical for cost efficiency.

TL;DR

For most fine-tuning: H100 with QLoRA/LoRA handles up to 70B efficiently. MI300X for full FP16 fine-tuning of large models. A100 as a budget option. L40S for small-model fine-tuning on a budget.

TOP 4 GPUS RANKED

NVIDIA H100 SXM5

NVIDIATOP PICK

Best all-around fine-tuning GPU

Memory

80GB HBM3

FP8 TFLOPS

3,958 TFLOPS

TDP

700W

Cloud Cost

~$2.50–3.50/hr

Pros

+Hugging Face PEFT/TRL fully optimized for H100
+Flash Attention 2 + gradient checkpointing fits 70B LoRA in 80GB
+NVLink for multi-GPU full fine-tuning
+Best FlashAttention2 throughput for long-context fine-tuning

Cons

−80GB requires LoRA/QLoRA for 70B models
−More expensive than A100 for small-model fine-tuning

Full Specs →Compare →

AMD Instinct MI300X

AMD

Best for full FP16 fine-tuning of large models

Memory

192GB HBM3

FP8 TFLOPS

2,614 TFLOPS

TDP

750W

Cloud Cost

~$3.20/hr

Pros

+Full FP16 fine-tuning of 70B on a single GPU (192GB)
+No LoRA needed — avoids quality loss from PEFT
+30% cheaper than H100
+HuggingFace Trainer + TRL support ROCm well

Cons

−Some PEFT/custom kernels need ROCm porting
−Less community fine-tuning content vs CUDA

Full Specs →Compare →

NVIDIA A100 SXM4

NVIDIA

Budget workhorse for fine-tuning

Memory

80GB HBM2e

FP8 TFLOPS

312 TFLOPS

TDP

400W

Cloud Cost

~$1.80/hr

Pros

+~40% cheaper than H100 for same VRAM
+Full ecosystem support: Axolotl, Unsloth, TRL all work
+80GB handles 70B QLoRA comfortably
+Widely available on Lambda, CoreWeave, vast.ai

Cons

−~3× slower than H100 at FP8
−No TF32 speedups (H100 is 2× faster for matrix multiply)

Full Specs →Compare →

NVIDIA L40S

NVIDIA

Cheapest per hour for small-model fine-tuning

Memory

48GB GDDR6

FP8 TFLOPS

733 TFLOPS

TDP

350W

Cloud Cost

~$1.40/hr

Pros

+Best $/hr for 7B–13B full fine-tuning
+733 FP8 TFLOPS — faster than A100 for training
+48GB handles 7B full FP16 + optimizer states comfortably
+Low TDP, cheapest cloud option

Cons

−48GB too small for 70B even with QLoRA (need multi-GPU)
−GDDR6 bandwidth lower than HBM — slower for memory-bound ops

Full Specs →Compare →

KEY FACTORS TO CONSIDER

QLoRA vs LoRA vs Full Fine-Tune: VRAM requirements

QLoRA (4-bit quantized base + FP16 adapters): 7B=~6GB, 13B=~10GB, 70B=~48GB. LoRA (16-bit base + adapters): 7B=~15GB, 70B=~160GB. Full FP16: 7B=~60GB, 70B=~560GB+. For 70B full fine-tune, you need 4–8× H100 or 3× MI300X.

Context length multiplies VRAM needs

Long-context fine-tuning (4K → 128K tokens) explodes activation memory. FlashAttention 2 mitigates this but doesn't eliminate it. 128K context fine-tuning of 7B at LoRA needs ~30GB vs ~6GB for 2K context. More VRAM enables longer context.

Unsloth and Axolotl for efficiency

Libraries like Unsloth (2× faster LoRA), Axolotl (multi-GPU LoRA), and TRL are tested primarily on CUDA. They work on ROCm but may need minor patches. These efficiency gains are significant — Unsloth cuts training time 30–70%.

FREQUENTLY ASKED QUESTIONS

Can I fine-tune Llama 3 70B on a single GPU?

Yes, with QLoRA (4-bit quantization) on a single H100 (80GB) or A100 (80GB). Full FP16 fine-tuning requires 4× H100 or 3× MI300X (576GB total). For most tasks, QLoRA 70B quality is 95–98% of full fine-tune quality.

How long does it take to fine-tune a 7B model?

On a single H100: ~2–4 hours for 10K examples at 2K context with LoRA. On A100: ~4–8 hours. On L40S: ~3–6 hours. Exact time depends on learning rate, batch size, epochs, and sequence length.

What is the cheapest GPU to fine-tune a 7B model?

L40S at ~$1.40/hr handles 7B full FP16 fine-tuning. A 10K-example run takes ~3–6 hours, costing $4–8 total. For QLoRA on 7B, even a single H100 spot at $1.80/hr finishes in 2–3 hours. Unsloth reduces this by 50%.

Is AMD MI300X good for fine-tuning?

Yes, especially for teams wanting to avoid LoRA/quantization on large models. The 192GB VRAM lets you full fine-tune 30B models directly. ROCm + PyTorch + TRL work well. The tradeoff is fewer fine-tuning-specific optimizations like Unsloth compared to CUDA.

More Tools

Compare GPUs Side-by-Side Cloud Pricing TCO Calculator GPU Comparator Benchmarks

GPU Pricing Pulse

Weekly digest of GPU cloud price changes, new hardware releases, and infrastructure deals.