Cost to Serve
1 Million LLM Tokens
GPU-level cost per million output tokens for the most common LLM models across cloud providers. Calculated from public on-demand pricing and vLLM throughput benchmarks at batch size 16.
Methodology
Cost per 1M tokens = ($/hr ÷ 3,600) ÷ output_tok/s × 1,000,000. Throughput figures are vLLM community benchmarks at batch size 16, decode phase only (output tokens). Pricing is public on-demand rates as of May 2026 — spot pricing is 40–60% lower.
Full Cost Table
| Model | Quant | GPU Config | Provider | $/hr | Tok/s | $/1M tokens | |
|---|---|---|---|---|---|---|---|
| Llama 3.1 8B | FP16 | 1× H100 80GB | Lambda Labs | $2.49 | 5,500 | $0.13 | Single GPU, comfortable fit |
| Llama 3.1 8B | FP16 | 1× MI300X 192GB | Lambda Labs | $3.49 | 7,000 | $0.14 | 192GB overkill; cheaper $/token |
| Llama 3.1 8B | FP16 | 1× L40S 48GB | Lambda Labs | $1.40 | 4,200 | $0.09 | Best $/token for 8B models |
| Llama 3.1 70B | FP16 | 2× H100 80GB | Lambda Labs | $4.98 | 2,200 | $0.63 | Tensor-parallel across 2 GPUs |
| Llama 3.1 70B | FP8 | 1× H100 80GB | Lambda Labs | $2.49 | 1,800 | $0.38 | Fits in 80GB with FP8 quant |
| Llama 3.1 70B | FP16 | 1× MI300X 192GB | Lambda Labs | $3.49 | 3,000 | $0.32 | Single GPU, no parallelism |
| Llama 3.1 70B | FP16 | 1× MI300X 192GB | CoreWeave | $4.69 | 3,000 | $0.43 | CoreWeave MI300X rate |
| Llama 3.1 405B | FP8 | 8× H100 80GB | Lambda Labs | $19.92 | 1,000 | $5.53 | 8-GPU tensor parallel |
| Llama 3.1 405B | FP8 | 4× MI300X 192GB | Lambda Labs | $13.96 | 1,250 | $3.10 | 4-GPU, 768GB total VRAM |
| Llama 3.1 405B | FP8 | 8× H100 80GB | AWS | $33.86 | 1,000 | $9.41 | p5.48xlarge on-demand |
| DeepSeek R1 671B | FP8 | 4× MI300X 192GB | Lambda Labs | $13.96 | 1,050 | $3.69 | 4× MI300X = 768GB, runs FP8 |
| DeepSeek R1 671B | INT4 | 8× H100 80GB | Lambda Labs | $19.92 | 700 | $7.90 | Needs INT4 AWQ to fit 640GB |
| DeepSeek R1 671B | FP8 | 4× MI300X 192GB | CoreWeave | $18.76 | 1,050 | $4.96 | CoreWeave rate |
| Mixtral 8×7B | FP16 | 2× H100 80GB | Lambda Labs | $4.98 | 3,200 | $0.43 | 94GB model, needs 2× H100 |
| Mixtral 8×7B | FP16 | 1× MI300X 192GB | Lambda Labs | $3.49 | 4,800 | $0.20 | Single MI300X, large batch |
| DeepSeek-R1 70B | FP8 | 1× H100 80GB | Lambda Labs | $2.49 | 1,800 | $0.38 | Distilled 70B variant |
| DeepSeek-R1 70B | FP16 | 1× MI300X 192GB | Lambda Labs | $3.49 | 2,800 | $0.35 | Full FP16 on single GPU |
Key Takeaways
MI300X wins for 70B+ models
AMD MI300X's 192GB VRAM means 70B runs on a single GPU — no tensor parallelism, no NVLink overhead. You get 3,000 tok/s at $3.49/hr vs 2,200 tok/s on 2× H100 at $4.98/hr. That's 2× better $/token for Llama 3.1 70B.
L40S is cheapest for 7B–8B models
NVIDIA L40S at ~$1.40/hr handles Llama 3.1 8B FP16 with room to spare (48GB vs 16GB needed). At 4,200 tok/s, cost is $0.093/M tokens — 26% cheaper than H100 for the same model.
DeepSeek R1 671B costs $3.69–7.90/M tokens
4× MI300X ($13.96/hr, ~1,050 tok/s) delivers $3.69/M tokens for DeepSeek R1. Using 8× H100 with INT4 quantization costs $7.90/M tokens — 2× more expensive for the same model.
Hyperscalers charge 2–4× more than specialist clouds
AWS p5.48xlarge (8× H100) costs $98.32/hr vs $19.92/hr for 8× H100 on Lambda Labs — 5× more expensive. For 405B Llama at 1,000 tok/s, that's $27.31/M tokens on AWS vs $5.53/M on Lambda.
Cost legend
Common Questions
What is the cheapest cost per million tokens for Llama 3.1 70B?
The cheapest option is a single AMD MI300X on Lambda Labs at ~$3.49/hr running Llama 3.1 70B FP16 at ~3,000 tokens/sec, giving approximately $0.32/million output tokens. This is cheaper than running 2× H100 for the same model because MI300X's 192GB VRAM eliminates the tensor parallelism overhead.
How much does it cost to serve DeepSeek R1 671B?
DeepSeek R1 671B requires 4× MI300X (768GB total) in FP8 or 8× H100 in INT4. On Lambda Labs: 4× MI300X at $13.96/hr with ~1,050 tokens/sec costs ~$3.69/million tokens. With 8× H100 at $19.92/hr and ~700 tokens/sec, cost is ~$7.90/million tokens. MI300X is roughly 2× cheaper for DeepSeek R1 serving.
Why is MI300X cheaper per token than H100 for large models?
AMD MI300X has 192GB of VRAM vs H100's 80GB. For models 70B+, MI300X can run the model on a single GPU (no tensor parallelism). Two H100s are needed for the same model — doubling GPU cost and adding NVLink synchronization overhead that reduces throughput. The MI300X's higher VRAM allows better batching, higher throughput, and lower effective $/token despite the higher per-GPU hourly rate.
What GPU has the lowest cost per token for small models (7B-8B)?
NVIDIA L40S at ~$1.40/hr running Llama 3.1 8B at ~4,200 tokens/sec delivers approximately $0.093/million tokens — the lowest cost per token for 7B–8B class models. L40S has 48GB GDDR6 which is more than enough for 8B FP16, and its lower cloud cost outweighs MI300X's higher throughput advantage at this model size.