How Much GPU VRAM Do You Need for AI in 2026? Complete Guide
A practical guide to GPU VRAM requirements for LLM training, fine-tuning, inference, and image generation in 2026. Includes memory calculators, quantization tradeoffs, and GPU recommendations by model size.
The most common question we get from teams standing up their first AI infrastructure: "How much GPU memory do we actually need?" It sounds simple. It is not. VRAM requirements depend on model size, precision, training vs inference, batch size, sequence length, optimizer state, and whether you are using gradient checkpointing. This guide gives you the formulas, the rough rules of thumb, and the GPU recommendations for every major use case.
The Memory Budget: What Actually Consumes VRAM
GPU memory is consumed by four distinct categories, and the balance between them changes completely depending on whether you are training or running inference:
For Inference
- Model weights — The dominant cost. Fixed at load time.
- KV cache — Attention key/value states for generated tokens. Scales with batch size × sequence length × layers.
- Activations — Intermediate values during the forward pass. Usually small for inference.
- Framework overhead — PyTorch/CUDA runtime state. Usually 1-2GB.
For Training
- Model weights — Same as inference.
- Optimizer state — Adam optimizer stores 2 additional copies of weights (momentum and variance). Doubles or triples your weight memory.
- Gradients — One copy of weights again.
- Activations — Much larger than inference; every intermediate value must be stored for the backward pass (unless using gradient checkpointing).
Model Weight Memory: The Rule of Thumb
For the model weights themselves, the calculation is straightforward:
- FP32: 4 bytes × number of parameters
- FP16 / BF16: 2 bytes × number of parameters
- FP8: 1 byte × number of parameters
- INT8: 1 byte × number of parameters
- INT4 (GPTQ/AWQ): 0.5 bytes × number of parameters
Quick reference for popular model sizes:
| Model Size | FP16/BF16 | INT8 | INT4 | Minimum GPU |
|---|---|---|---|---|
| 7B parameters | ~14GB | ~7GB | ~3.5GB | RTX 4090 (24GB) at FP16 |
| 13B parameters | ~26GB | ~13GB | ~6.5GB | H100 (80GB) at FP16 comfortably |
| 30B parameters | ~60GB | ~30GB | ~15GB | 2× H100 at FP16; 1× H100 at INT8 |
| 70B parameters | ~140GB | ~70GB | ~35GB | 2× H100 at FP16; 1× H100 at INT8; 1× MI300X at FP16 |
| 180B parameters | ~360GB | ~180GB | ~90GB | 5× H100 at FP16; 3× MI300X at FP16 |
| 405B parameters | ~810GB | ~405GB | ~202GB | 11× H100; 5× MI300X; 3× B200 |
Note: Add 10-20% for KV cache and framework overhead in inference scenarios.
Training Memory: The Multiplier Effect
Training requires far more memory than inference for the same model. Here is the breakdown for a 7B parameter model in BF16:
- Model weights (BF16): 14GB
- Gradients (BF16): 14GB
- Adam optimizer state (FP32): 56GB (2 moments × 4 bytes × 7B params)
- Activations (varies): 8-30GB depending on batch size and sequence length
- Total for full fine-tune: 92-114GB — requires 2× H100 or 1× MI300X
This is why "a GPU with 24GB can run a 7B model for inference" does not mean "a GPU with 24GB can fine-tune a 7B model." For training, you are looking at 6-8× the memory of the weights alone.
Techniques That Reduce Training Memory
Gradient checkpointing (also called activation recomputation): Instead of storing all activations, recompute them during the backward pass. Reduces activation memory by 4-10× at the cost of ~33% more compute. Almost always worth it for large models.
LoRA / QLoRA fine-tuning: Rather than training all parameters, freeze the base model and train small low-rank adapter matrices. A QLoRA fine-tune of a 70B model can run on a single 80GB H100 — what would normally require 8+ GPUs.
ZeRO optimizer (DeepSpeed): Shards the optimizer state, gradients, and optionally weights across multiple GPUs. ZeRO-3 can train models 4-8× larger than would fit on a single GPU, at the cost of more inter-GPU communication.
VRAM Requirements by Use Case
Local Development / Experimentation
Recommended: 24-48GB VRAM (RTX 4090, or 2× RTX 4090)
For running 7B-13B models at inference, experimenting with LoRA fine-tuning, and prototyping inference pipelines. The RTX 4090 at 24GB is the sweet spot for developers — cheap enough to put in a workstation, powerful enough for meaningful work.
Production Inference: Small-Medium Models (≤30B)
Recommended: NVIDIA L40S (48GB) or A100 80GB
The L40S has Ada Lovelace architecture with fast GDDR6X memory (not HBM) — great throughput for inference, lower cost than H100, and 48GB fits 13B models at FP16 or 30B at INT8 comfortably.
Production Inference: Large Models (70B+)
Recommended: NVIDIA H100/H200 SXM5 or AMD MI300X
The choice comes down to precision requirements. If you need FP16 quality, MI300X's 192GB makes it the most memory-efficient single-GPU option. If you are comfortable with INT8 quantization, H100 at 80GB serves 70B at ~15% quality degradation (model-dependent).
LLM Training from Scratch (1B-70B)
Recommended: 8× H100 SXM5 node (640GB total) or 8× MI300X (1,536GB total)
Full pretraining requires significant cluster infrastructure. A single 8-GPU H100 node can pretrain models up to ~100B parameters with ZeRO-3. For 70B models, an 8-GPU H100 node with gradient checkpointing and ZeRO-3 is a common configuration.
Fine-tuning (LoRA/QLoRA)
Recommended: 1-2× H100, A100, or MI300X
QLoRA fine-tuning is remarkably memory-efficient. A single A100 80GB can fine-tune models up to 180B parameters with QLoRA. For most teams, this is the most practical entry point into large model customization.
The Quantization Tradeoff: Speed vs Quality
Quantization reduces memory and increases throughput at the cost of some output quality. The practical tradeoffs:
| Precision | Memory vs FP16 | Quality Impact | Recommended For |
|---|---|---|---|
| FP16/BF16 | 1× (baseline) | None (baseline) | Training, quality-sensitive inference |
| FP8 | 0.5× | Minimal (<1% perplexity change) | Training with Transformer Engine, Hopper/Blackwell only |
| INT8 (LLM.int8) | 0.5× | Low (1-3% quality drop) | Production inference where memory is the bottleneck |
| INT4 (GPTQ/AWQ) | 0.25× | Moderate (3-8% quality drop) | Edge deployment, memory-extremely-constrained inference |
| INT4 (QLoRA training) | 0.25× (weights only) | Low for fine-tuning | Memory-efficient fine-tuning of large models |
Quick Decision Guide
Tell us your model size and use case, and here is where to start:
- Inference, ≤7B: Any modern GPU with 16GB+ VRAM. RTX 4080 or better.
- Inference, 13B at FP16: 1× H100 (80GB), A100 (80GB), or L40S (48GB at INT8).
- Inference, 70B at FP16: 1× MI300X (192GB) — the only single-GPU option at full precision.
- Inference, 70B at INT8: 1× H100 SXM5 (80GB) — tight fit, use paged attention.
- Fine-tuning (LoRA), any model up to 70B: 1× A100 or H100 with QLoRA.
- Full fine-tuning, 7B: 4× A100 or 2× MI300X with ZeRO-3.
- Pretraining, 70B+: 64-GPU H100 or MI300X cluster minimum.
For a personalized recommendation based on your specific model and budget, use our GPU Finder tool. For detailed GPU-by-GPU specs and memory capacity, browse our GPU specification pages or compare two GPUs head-to-head on the GPU Comparator.
Try Our GPU Tools
Compare GPUs, calculate TCO, and get AI-powered recommendations.