Skip to content
Updated May 2026

vLLM Deployment
Calculator

Select your model, GPU, and target throughput — get exact GPU count, monthly cloud cost, and cost per million tokens across providers. Based on vLLM benchmarks at batch size 16.

Deployment Plan

GPUs Needed

VRAM-constrained

Model VRAM

70GB

FP8 quantization

Throughput

1,800

tokens/sec

Cost per Hour

$2.19

1× GPUs

Monthly Cost

$639

at 40% utilization

Cost per 1M Tokens

$0.34

output tokens

RECOMMENDED CONFIG

1× H100 SXM5 80GB on Lambda Labs running Llama 3.1 70B (FP8) via vLLM. Delivers 1,800 tok/s at $639/mo.

Provider Cost Comparison (1× H100 SXM5 80GB)

How the Calculator Works

1

Minimum GPUs for VRAM

The model must fit in GPU memory. GPU count = ⌈model_vram / gpu_vram⌉. For Llama 3.1 70B FP8 (70GB) on H100 (80GB): 1 GPU minimum.

2

GPUs for Target Throughput

Throughput scales linearly with GPU count (with some overhead). GPUs for throughput = ⌈target_tok_s / single_gpu_tok_s⌉. Final count = max(VRAM constraint, throughput constraint).

3

Monthly Cost

Monthly cost = GPU_count × hourly_rate × 730 hours × utilization_factor. Utilization accounts for idle time during off-peak hours.

4

Cost per Million Tokens

$/1M tokens = (hourly_cost / (tokens_per_sec × 3600)) × 1,000,000. This is the output-token cost at sustained throughput — actual cost depends on prompt/completion ratio.

Assumptions & Limitations

  • Throughput figures are vLLM community benchmarks at batch size 16, decode phase only (output tokens). Actual throughput depends on prompt length and batch configuration.
  • KV cache memory not included in the VRAM estimate — for small models at low concurrency, this is fine. For 100+ concurrent users or long context (32K+), add KV cache budget separately.
  • Multi-GPU tensor parallelism efficiency assumed at ~85% (15% overhead). Real overhead varies by model architecture and NVLink bandwidth.
  • Throughput figures are for FP8 and FP16 with standard vLLM defaults. TensorRT-LLM can deliver 20–50% more throughput on NVIDIA GPUs at the cost of longer compilation times.
  • Prices are public on-demand rates as of May 2026. Spot/reserved pricing is 30–60% lower.