Skip to content
Updated May 2026

Cost to Serve
1 Million LLM Tokens

GPU-level cost per million output tokens for the most common LLM models across cloud providers. Calculated from public on-demand pricing and vLLM throughput benchmarks at batch size 16.

Methodology

Cost per 1M tokens = ($/hr ÷ 3,600) ÷ output_tok/s × 1,000,000. Throughput figures are vLLM community benchmarks at batch size 16, decode phase only (output tokens). Pricing is public on-demand rates as of May 2026 — spot pricing is 40–60% lower.

Full Cost Table

ModelQuantGPU ConfigProvider$/hrTok/s$/1M tokens
Llama 3.1 8BFP161× H100 80GBLambda Labs$2.495,500$0.13Single GPU, comfortable fit
Llama 3.1 8BFP161× MI300X 192GBLambda Labs$3.497,000$0.14192GB overkill; cheaper $/token
Llama 3.1 8BFP161× L40S 48GBLambda Labs$1.404,200$0.09Best $/token for 8B models
Llama 3.1 70BFP162× H100 80GBLambda Labs$4.982,200$0.63Tensor-parallel across 2 GPUs
Llama 3.1 70BFP81× H100 80GBLambda Labs$2.491,800$0.38Fits in 80GB with FP8 quant
Llama 3.1 70BFP161× MI300X 192GBLambda Labs$3.493,000$0.32Single GPU, no parallelism
Llama 3.1 70BFP161× MI300X 192GBCoreWeave$4.693,000$0.43CoreWeave MI300X rate
Llama 3.1 405BFP88× H100 80GBLambda Labs$19.921,000$5.538-GPU tensor parallel
Llama 3.1 405BFP84× MI300X 192GBLambda Labs$13.961,250$3.104-GPU, 768GB total VRAM
Llama 3.1 405BFP88× H100 80GBAWS$33.861,000$9.41p5.48xlarge on-demand
DeepSeek R1 671BFP84× MI300X 192GBLambda Labs$13.961,050$3.694× MI300X = 768GB, runs FP8
DeepSeek R1 671BINT48× H100 80GBLambda Labs$19.92700$7.90Needs INT4 AWQ to fit 640GB
DeepSeek R1 671BFP84× MI300X 192GBCoreWeave$18.761,050$4.96CoreWeave rate
Mixtral 8×7BFP162× H100 80GBLambda Labs$4.983,200$0.4394GB model, needs 2× H100
Mixtral 8×7BFP161× MI300X 192GBLambda Labs$3.494,800$0.20Single MI300X, large batch
DeepSeek-R1 70BFP81× H100 80GBLambda Labs$2.491,800$0.38Distilled 70B variant
DeepSeek-R1 70BFP161× MI300X 192GBLambda Labs$3.492,800$0.35Full FP16 on single GPU

Key Takeaways

MI300X wins for 70B+ models

AMD MI300X's 192GB VRAM means 70B runs on a single GPU — no tensor parallelism, no NVLink overhead. You get 3,000 tok/s at $3.49/hr vs 2,200 tok/s on 2× H100 at $4.98/hr. That's 2× better $/token for Llama 3.1 70B.

L40S is cheapest for 7B–8B models

NVIDIA L40S at ~$1.40/hr handles Llama 3.1 8B FP16 with room to spare (48GB vs 16GB needed). At 4,200 tok/s, cost is $0.093/M tokens — 26% cheaper than H100 for the same model.

DeepSeek R1 671B costs $3.69–7.90/M tokens

4× MI300X ($13.96/hr, ~1,050 tok/s) delivers $3.69/M tokens for DeepSeek R1. Using 8× H100 with INT4 quantization costs $7.90/M tokens — 2× more expensive for the same model.

Hyperscalers charge 2–4× more than specialist clouds

AWS p5.48xlarge (8× H100) costs $98.32/hr vs $19.92/hr for 8× H100 on Lambda Labs — 5× more expensive. For 405B Llama at 1,000 tok/s, that's $27.31/M tokens on AWS vs $5.53/M on Lambda.

Cost legend

< $0.30/MExcellent
$0.30–0.80/MGood
$0.80–2.00/MAverage
$2.00–5.00/MExpensive
> $5.00/MVery expensive

Common Questions

What is the cheapest cost per million tokens for Llama 3.1 70B?

The cheapest option is a single AMD MI300X on Lambda Labs at ~$3.49/hr running Llama 3.1 70B FP16 at ~3,000 tokens/sec, giving approximately $0.32/million output tokens. This is cheaper than running 2× H100 for the same model because MI300X's 192GB VRAM eliminates the tensor parallelism overhead.

How much does it cost to serve DeepSeek R1 671B?

DeepSeek R1 671B requires 4× MI300X (768GB total) in FP8 or 8× H100 in INT4. On Lambda Labs: 4× MI300X at $13.96/hr with ~1,050 tokens/sec costs ~$3.69/million tokens. With 8× H100 at $19.92/hr and ~700 tokens/sec, cost is ~$7.90/million tokens. MI300X is roughly 2× cheaper for DeepSeek R1 serving.

Why is MI300X cheaper per token than H100 for large models?

AMD MI300X has 192GB of VRAM vs H100's 80GB. For models 70B+, MI300X can run the model on a single GPU (no tensor parallelism). Two H100s are needed for the same model — doubling GPU cost and adding NVLink synchronization overhead that reduces throughput. The MI300X's higher VRAM allows better batching, higher throughput, and lower effective $/token despite the higher per-GPU hourly rate.

What GPU has the lowest cost per token for small models (7B-8B)?

NVIDIA L40S at ~$1.40/hr running Llama 3.1 8B at ~4,200 tokens/sec delivers approximately $0.093/million tokens — the lowest cost per token for 7B–8B class models. L40S has 48GB GDDR6 which is more than enough for 8B FP16, and its lower cloud cost outweighs MI300X's higher throughput advantage at this model size.