H100 vs MI300X — which GPU should I choose?

H100 if you need the broadest software ecosystem (CUDA, TensorRT, vLLM). MI300X if you need maximum VRAM (192GB vs 80GB) for large model inference. MI300X offers better $/TFLOP but NVIDIA's software stack is more mature.

How much VRAM do I need for LLM inference?

A model needs ~2x its parameter count in GB for FP16 inference (70B model = ~140GB VRAM). With INT8 quantization ~70GB, INT4 ~35GB. A single H100 (80GB) runs 70B at INT8; MI300X (192GB) runs it at full FP16.

Should we buy GPUs or use cloud GPU instances?

If running GPUs 60%+ of the time, on-premise ownership wins on 3-year TCO. Below 40% utilization, cloud is more cost-effective. Many enterprises use hybrid: owned hardware for baseline, cloud for peak demand.

What is the cheapest way to rent an H100 GPU?

As of 2026, H100 cloud pricing ranges from $2.23/hr (Lambda, RunPod spot) to $4+/hr (AWS, Azure on-demand). Reserved instances and spot pricing offer 30-60% savings. CoreWeave and Lambda typically offer the lowest rates.

What GPU is best for LLM training in 2026?

NVIDIA H200 SXM (141GB HBM3e) for proven clusters, B200 for next-gen 4-5x speedup over H100, or AMD MI300X (192GB) for budget-conscious teams. For JAX workloads, Google TPU v5p pods offer unmatched scale.

Updated May 2026

Cost to Serve
1 Million LLM Tokens

GPU-level cost per million output tokens for the most common LLM models across cloud providers. Calculated from public on-demand pricing and vLLM throughput benchmarks at batch size 16.

Methodology

Cost per 1M tokens = ($/hr ÷ 3,600) ÷ output_tok/s × 1,000,000. Throughput figures are vLLM community benchmarks at batch size 16, decode phase only (output tokens). Pricing is public on-demand rates as of May 2026 — spot pricing is 40–60% lower.

Full Cost Table

Model	Quant	GPU Config	Provider	$/hr	Tok/s	$/1M tokens
Llama 3.1 8B	FP16	1× H100 80GB	Lambda Labs	$2.49	5,500	$0.13	Single GPU, comfortable fit
Llama 3.1 8B	FP16	1× MI300X 192GB	Lambda Labs	$3.49	7,000	$0.14	192GB overkill; cheaper $/token
Llama 3.1 8B	FP16	1× L40S 48GB	Lambda Labs	$1.40	4,200	$0.09	Best $/token for 8B models
Llama 3.1 70B	FP16	2× H100 80GB	Lambda Labs	$4.98	2,200	$0.63	Tensor-parallel across 2 GPUs
Llama 3.1 70B	FP8	1× H100 80GB	Lambda Labs	$2.49	1,800	$0.38	Fits in 80GB with FP8 quant
Llama 3.1 70B	FP16	1× MI300X 192GB	Lambda Labs	$3.49	3,000	$0.32	Single GPU, no parallelism
Llama 3.1 70B	FP16	1× MI300X 192GB	CoreWeave	$4.69	3,000	$0.43	CoreWeave MI300X rate
Llama 3.1 405B	FP8	8× H100 80GB	Lambda Labs	$19.92	1,000	$5.53	8-GPU tensor parallel
Llama 3.1 405B	FP8	4× MI300X 192GB	Lambda Labs	$13.96	1,250	$3.10	4-GPU, 768GB total VRAM
Llama 3.1 405B	FP8	8× H100 80GB	AWS	$33.86	1,000	$9.41	p5.48xlarge on-demand
DeepSeek R1 671B	FP8	4× MI300X 192GB	Lambda Labs	$13.96	1,050	$3.69	4× MI300X = 768GB, runs FP8
DeepSeek R1 671B	INT4	8× H100 80GB	Lambda Labs	$19.92	700	$7.90	Needs INT4 AWQ to fit 640GB
DeepSeek R1 671B	FP8	4× MI300X 192GB	CoreWeave	$18.76	1,050	$4.96	CoreWeave rate
Mixtral 8×7B	FP16	2× H100 80GB	Lambda Labs	$4.98	3,200	$0.43	94GB model, needs 2× H100
Mixtral 8×7B	FP16	1× MI300X 192GB	Lambda Labs	$3.49	4,800	$0.20	Single MI300X, large batch
DeepSeek-R1 70B	FP8	1× H100 80GB	Lambda Labs	$2.49	1,800	$0.38	Distilled 70B variant
DeepSeek-R1 70B	FP16	1× MI300X 192GB	Lambda Labs	$3.49	2,800	$0.35	Full FP16 on single GPU

Key Takeaways

MI300X wins for 70B+ models

AMD MI300X's 192GB VRAM means 70B runs on a single GPU — no tensor parallelism, no NVLink overhead. You get 3,000 tok/s at $3.49/hr vs 2,200 tok/s on 2× H100 at $4.98/hr. That's 2× better $/token for Llama 3.1 70B.

L40S is cheapest for 7B–8B models

NVIDIA L40S at ~$1.40/hr handles Llama 3.1 8B FP16 with room to spare (48GB vs 16GB needed). At 4,200 tok/s, cost is $0.093/M tokens — 26% cheaper than H100 for the same model.

DeepSeek R1 671B costs $3.69–7.90/M tokens

4× MI300X ($13.96/hr, ~1,050 tok/s) delivers $3.69/M tokens for DeepSeek R1. Using 8× H100 with INT4 quantization costs $7.90/M tokens — 2× more expensive for the same model.

Hyperscalers charge 2–4× more than specialist clouds

AWS p5.48xlarge (8× H100) costs $98.32/hr vs $19.92/hr for 8× H100 on Lambda Labs — 5× more expensive. For 405B Llama at 1,000 tok/s, that's $27.31/M tokens on AWS vs $5.53/M on Lambda.

Cost legend

< $0.30/MExcellent

$0.30–0.80/MGood

$0.80–2.00/MAverage

$2.00–5.00/MExpensive

> $5.00/MVery expensive

Common Questions

What is the cheapest cost per million tokens for Llama 3.1 70B?

The cheapest option is a single AMD MI300X on Lambda Labs at ~$3.49/hr running Llama 3.1 70B FP16 at ~3,000 tokens/sec, giving approximately $0.32/million output tokens. This is cheaper than running 2× H100 for the same model because MI300X's 192GB VRAM eliminates the tensor parallelism overhead.

How much does it cost to serve DeepSeek R1 671B?

DeepSeek R1 671B requires 4× MI300X (768GB total) in FP8 or 8× H100 in INT4. On Lambda Labs: 4× MI300X at $13.96/hr with ~1,050 tokens/sec costs ~$3.69/million tokens. With 8× H100 at $19.92/hr and ~700 tokens/sec, cost is ~$7.90/million tokens. MI300X is roughly 2× cheaper for DeepSeek R1 serving.

Why is MI300X cheaper per token than H100 for large models?

AMD MI300X has 192GB of VRAM vs H100's 80GB. For models 70B+, MI300X can run the model on a single GPU (no tensor parallelism). Two H100s are needed for the same model — doubling GPU cost and adding NVLink synchronization overhead that reduces throughput. The MI300X's higher VRAM allows better batching, higher throughput, and lower effective $/token despite the higher per-GPU hourly rate.

What GPU has the lowest cost per token for small models (7B-8B)?

NVIDIA L40S at ~$1.40/hr running Llama 3.1 8B at ~4,200 tokens/sec delivers approximately $0.093/million tokens — the lowest cost per token for 7B–8B class models. L40S has 48GB GDDR6 which is more than enough for 8B FP16, and its lower cloud cost outweighs MI300X's higher throughput advantage at this model size.

Cost to Serve1 Million LLM Tokens