Skip to content
Blog/AI Training
AI Training2026-04-1612 min read

How Much GPU VRAM Do You Need for AI in 2026? Complete Guide

A practical guide to GPU VRAM requirements for LLM training, fine-tuning, inference, and image generation in 2026. Includes memory calculators, quantization tradeoffs, and GPU recommendations by model size.

The most common question we get from teams standing up their first AI infrastructure: "How much GPU memory do we actually need?" It sounds simple. It is not. VRAM requirements depend on model size, precision, training vs inference, batch size, sequence length, optimizer state, and whether you are using gradient checkpointing. This guide gives you the formulas, the rough rules of thumb, and the GPU recommendations for every major use case.

The Memory Budget: What Actually Consumes VRAM

GPU memory is consumed by four distinct categories, and the balance between them changes completely depending on whether you are training or running inference:

For Inference

  1. Model weights — The dominant cost. Fixed at load time.
  2. KV cache — Attention key/value states for generated tokens. Scales with batch size × sequence length × layers.
  3. Activations — Intermediate values during the forward pass. Usually small for inference.
  4. Framework overhead — PyTorch/CUDA runtime state. Usually 1-2GB.

For Training

  1. Model weights — Same as inference.
  2. Optimizer state — Adam optimizer stores 2 additional copies of weights (momentum and variance). Doubles or triples your weight memory.
  3. Gradients — One copy of weights again.
  4. Activations — Much larger than inference; every intermediate value must be stored for the backward pass (unless using gradient checkpointing).

Model Weight Memory: The Rule of Thumb

For the model weights themselves, the calculation is straightforward:

  • FP32: 4 bytes × number of parameters
  • FP16 / BF16: 2 bytes × number of parameters
  • FP8: 1 byte × number of parameters
  • INT8: 1 byte × number of parameters
  • INT4 (GPTQ/AWQ): 0.5 bytes × number of parameters

Quick reference for popular model sizes:

Model SizeFP16/BF16INT8INT4Minimum GPU
7B parameters~14GB~7GB~3.5GBRTX 4090 (24GB) at FP16
13B parameters~26GB~13GB~6.5GBH100 (80GB) at FP16 comfortably
30B parameters~60GB~30GB~15GB2× H100 at FP16; 1× H100 at INT8
70B parameters~140GB~70GB~35GB2× H100 at FP16; 1× H100 at INT8; 1× MI300X at FP16
180B parameters~360GB~180GB~90GB5× H100 at FP16; 3× MI300X at FP16
405B parameters~810GB~405GB~202GB11× H100; 5× MI300X; 3× B200

Note: Add 10-20% for KV cache and framework overhead in inference scenarios.

Training Memory: The Multiplier Effect

Training requires far more memory than inference for the same model. Here is the breakdown for a 7B parameter model in BF16:

  • Model weights (BF16): 14GB
  • Gradients (BF16): 14GB
  • Adam optimizer state (FP32): 56GB (2 moments × 4 bytes × 7B params)
  • Activations (varies): 8-30GB depending on batch size and sequence length
  • Total for full fine-tune: 92-114GB — requires 2× H100 or 1× MI300X

This is why "a GPU with 24GB can run a 7B model for inference" does not mean "a GPU with 24GB can fine-tune a 7B model." For training, you are looking at 6-8× the memory of the weights alone.

Techniques That Reduce Training Memory

Gradient checkpointing (also called activation recomputation): Instead of storing all activations, recompute them during the backward pass. Reduces activation memory by 4-10× at the cost of ~33% more compute. Almost always worth it for large models.

LoRA / QLoRA fine-tuning: Rather than training all parameters, freeze the base model and train small low-rank adapter matrices. A QLoRA fine-tune of a 70B model can run on a single 80GB H100 — what would normally require 8+ GPUs.

ZeRO optimizer (DeepSpeed): Shards the optimizer state, gradients, and optionally weights across multiple GPUs. ZeRO-3 can train models 4-8× larger than would fit on a single GPU, at the cost of more inter-GPU communication.

VRAM Requirements by Use Case

Local Development / Experimentation

Recommended: 24-48GB VRAM (RTX 4090, or 2× RTX 4090)

For running 7B-13B models at inference, experimenting with LoRA fine-tuning, and prototyping inference pipelines. The RTX 4090 at 24GB is the sweet spot for developers — cheap enough to put in a workstation, powerful enough for meaningful work.

Production Inference: Small-Medium Models (≤30B)

Recommended: NVIDIA L40S (48GB) or A100 80GB

The L40S has Ada Lovelace architecture with fast GDDR6X memory (not HBM) — great throughput for inference, lower cost than H100, and 48GB fits 13B models at FP16 or 30B at INT8 comfortably.

Production Inference: Large Models (70B+)

Recommended: NVIDIA H100/H200 SXM5 or AMD MI300X

The choice comes down to precision requirements. If you need FP16 quality, MI300X's 192GB makes it the most memory-efficient single-GPU option. If you are comfortable with INT8 quantization, H100 at 80GB serves 70B at ~15% quality degradation (model-dependent).

LLM Training from Scratch (1B-70B)

Recommended: 8× H100 SXM5 node (640GB total) or 8× MI300X (1,536GB total)

Full pretraining requires significant cluster infrastructure. A single 8-GPU H100 node can pretrain models up to ~100B parameters with ZeRO-3. For 70B models, an 8-GPU H100 node with gradient checkpointing and ZeRO-3 is a common configuration.

Fine-tuning (LoRA/QLoRA)

Recommended: 1-2× H100, A100, or MI300X

QLoRA fine-tuning is remarkably memory-efficient. A single A100 80GB can fine-tune models up to 180B parameters with QLoRA. For most teams, this is the most practical entry point into large model customization.

The Quantization Tradeoff: Speed vs Quality

Quantization reduces memory and increases throughput at the cost of some output quality. The practical tradeoffs:

PrecisionMemory vs FP16Quality ImpactRecommended For
FP16/BF161× (baseline)None (baseline)Training, quality-sensitive inference
FP80.5×Minimal (<1% perplexity change)Training with Transformer Engine, Hopper/Blackwell only
INT8 (LLM.int8)0.5×Low (1-3% quality drop)Production inference where memory is the bottleneck
INT4 (GPTQ/AWQ)0.25×Moderate (3-8% quality drop)Edge deployment, memory-extremely-constrained inference
INT4 (QLoRA training)0.25× (weights only)Low for fine-tuningMemory-efficient fine-tuning of large models

Quick Decision Guide

Tell us your model size and use case, and here is where to start:

  • Inference, ≤7B: Any modern GPU with 16GB+ VRAM. RTX 4080 or better.
  • Inference, 13B at FP16: 1× H100 (80GB), A100 (80GB), or L40S (48GB at INT8).
  • Inference, 70B at FP16: 1× MI300X (192GB) — the only single-GPU option at full precision.
  • Inference, 70B at INT8: 1× H100 SXM5 (80GB) — tight fit, use paged attention.
  • Fine-tuning (LoRA), any model up to 70B: 1× A100 or H100 with QLoRA.
  • Full fine-tuning, 7B: 4× A100 or 2× MI300X with ZeRO-3.
  • Pretraining, 70B+: 64-GPU H100 or MI300X cluster minimum.

For a personalized recommendation based on your specific model and budget, use our GPU Finder tool. For detailed GPU-by-GPU specs and memory capacity, browse our GPU specification pages or compare two GPUs head-to-head on the GPU Comparator.

GPU VRAMhow much VRAMVRAM for LLMGPU memory requirementsmodel VRAM calculatorquantizationFP16 BF16 INT8 INT4

Try Our GPU Tools

Compare GPUs, calculate TCO, and get AI-powered recommendations.