H100 vs MI300X — which GPU should I choose?

H100 if you need the broadest software ecosystem (CUDA, TensorRT, vLLM). MI300X if you need maximum VRAM (192GB vs 80GB) for large model inference. MI300X offers better $/TFLOP but NVIDIA's software stack is more mature.

How much VRAM do I need for LLM inference?

A model needs ~2x its parameter count in GB for FP16 inference (70B model = ~140GB VRAM). With INT8 quantization ~70GB, INT4 ~35GB. A single H100 (80GB) runs 70B at INT8; MI300X (192GB) runs it at full FP16.

Should we buy GPUs or use cloud GPU instances?

If running GPUs 60%+ of the time, on-premise ownership wins on 3-year TCO. Below 40% utilization, cloud is more cost-effective. Many enterprises use hybrid: owned hardware for baseline, cloud for peak demand.

What is the cheapest way to rent an H100 GPU?

As of 2026, H100 cloud pricing ranges from $2.23/hr (Lambda, RunPod spot) to $4+/hr (AWS, Azure on-demand). Reserved instances and spot pricing offer 30-60% savings. CoreWeave and Lambda typically offer the lowest rates.

What GPU is best for LLM training in 2026?

NVIDIA H200 SXM (141GB HBM3e) for proven clusters, B200 for next-gen 4-5x speedup over H100, or AMD MI300X (192GB) for budget-conscious teams. For JAX workloads, Google TPU v5p pods offer unmatched scale.

Blog/AI Training

AI Training2026-04-1612 min read

How Much GPU VRAM Do You Need for AI in 2026? Complete Guide

A practical guide to GPU VRAM requirements for LLM training, fine-tuning, inference, and image generation in 2026. Includes memory calculators, quantization tradeoffs, and GPU recommendations by model size.

The most common question we get from teams standing up their first AI infrastructure: "How much GPU memory do we actually need?" It sounds simple. It is not. VRAM requirements depend on model size, precision, training vs inference, batch size, sequence length, optimizer state, and whether you are using gradient checkpointing. This guide gives you the formulas, the rough rules of thumb, and the GPU recommendations for every major use case.

The Memory Budget: What Actually Consumes VRAM

GPU memory is consumed by four distinct categories, and the balance between them changes completely depending on whether you are training or running inference:

For Inference

Model weights — The dominant cost. Fixed at load time.
KV cache — Attention key/value states for generated tokens. Scales with batch size × sequence length × layers.
Activations — Intermediate values during the forward pass. Usually small for inference.
Framework overhead — PyTorch/CUDA runtime state. Usually 1-2GB.

For Training

Model weights — Same as inference.
Optimizer state — Adam optimizer stores 2 additional copies of weights (momentum and variance). Doubles or triples your weight memory.
Gradients — One copy of weights again.
Activations — Much larger than inference; every intermediate value must be stored for the backward pass (unless using gradient checkpointing).

Model Weight Memory: The Rule of Thumb

For the model weights themselves, the calculation is straightforward:

FP32: 4 bytes × number of parameters
FP16 / BF16: 2 bytes × number of parameters
FP8: 1 byte × number of parameters
INT8: 1 byte × number of parameters
INT4 (GPTQ/AWQ): 0.5 bytes × number of parameters

Quick reference for popular model sizes:

Model Size	FP16/BF16	INT8	INT4	Minimum GPU
7B parameters	~14GB	~7GB	~3.5GB	RTX 4090 (24GB) at FP16
13B parameters	~26GB	~13GB	~6.5GB	H100 (80GB) at FP16 comfortably
30B parameters	~60GB	~30GB	~15GB	2× H100 at FP16; 1× H100 at INT8
70B parameters	~140GB	~70GB	~35GB	2× H100 at FP16; 1× H100 at INT8; 1× MI300X at FP16
180B parameters	~360GB	~180GB	~90GB	5× H100 at FP16; 3× MI300X at FP16
405B parameters	~810GB	~405GB	~202GB	11× H100; 5× MI300X; 3× B200

Note: Add 10-20% for KV cache and framework overhead in inference scenarios.

Training Memory: The Multiplier Effect

Training requires far more memory than inference for the same model. Here is the breakdown for a 7B parameter model in BF16:

Model weights (BF16): 14GB
Gradients (BF16): 14GB
Adam optimizer state (FP32): 56GB (2 moments × 4 bytes × 7B params)
Activations (varies): 8-30GB depending on batch size and sequence length
Total for full fine-tune: 92-114GB — requires 2× H100 or 1× MI300X

This is why "a GPU with 24GB can run a 7B model for inference" does not mean "a GPU with 24GB can fine-tune a 7B model." For training, you are looking at 6-8× the memory of the weights alone.

Techniques That Reduce Training Memory

Gradient checkpointing (also called activation recomputation): Instead of storing all activations, recompute them during the backward pass. Reduces activation memory by 4-10× at the cost of ~33% more compute. Almost always worth it for large models.

LoRA / QLoRA fine-tuning: Rather than training all parameters, freeze the base model and train small low-rank adapter matrices. A QLoRA fine-tune of a 70B model can run on a single 80GB H100 — what would normally require 8+ GPUs.

ZeRO optimizer (DeepSpeed): Shards the optimizer state, gradients, and optionally weights across multiple GPUs. ZeRO-3 can train models 4-8× larger than would fit on a single GPU, at the cost of more inter-GPU communication.

VRAM Requirements by Use Case

Local Development / Experimentation

Recommended: 24-48GB VRAM (RTX 4090, or 2× RTX 4090)

For running 7B-13B models at inference, experimenting with LoRA fine-tuning, and prototyping inference pipelines. The RTX 4090 at 24GB is the sweet spot for developers — cheap enough to put in a workstation, powerful enough for meaningful work.

Production Inference: Small-Medium Models (≤30B)

Recommended: NVIDIA L40S (48GB) or A100 80GB

The L40S has Ada Lovelace architecture with fast GDDR6X memory (not HBM) — great throughput for inference, lower cost than H100, and 48GB fits 13B models at FP16 or 30B at INT8 comfortably.

Production Inference: Large Models (70B+)

Recommended: NVIDIA H100/H200 SXM5 or AMD MI300X

The choice comes down to precision requirements. If you need FP16 quality, MI300X's 192GB makes it the most memory-efficient single-GPU option. If you are comfortable with INT8 quantization, H100 at 80GB serves 70B at ~15% quality degradation (model-dependent).

LLM Training from Scratch (1B-70B)

Recommended: 8× H100 SXM5 node (640GB total) or 8× MI300X (1,536GB total)

Full pretraining requires significant cluster infrastructure. A single 8-GPU H100 node can pretrain models up to ~100B parameters with ZeRO-3. For 70B models, an 8-GPU H100 node with gradient checkpointing and ZeRO-3 is a common configuration.

Fine-tuning (LoRA/QLoRA)

Recommended: 1-2× H100, A100, or MI300X

QLoRA fine-tuning is remarkably memory-efficient. A single A100 80GB can fine-tune models up to 180B parameters with QLoRA. For most teams, this is the most practical entry point into large model customization.

The Quantization Tradeoff: Speed vs Quality

Quantization reduces memory and increases throughput at the cost of some output quality. The practical tradeoffs:

Precision	Memory vs FP16	Quality Impact	Recommended For
FP16/BF16	1× (baseline)	None (baseline)	Training, quality-sensitive inference
FP8	0.5×	Minimal (<1% perplexity change)	Training with Transformer Engine, Hopper/Blackwell only
INT8 (LLM.int8)	0.5×	Low (1-3% quality drop)	Production inference where memory is the bottleneck
INT4 (GPTQ/AWQ)	0.25×	Moderate (3-8% quality drop)	Edge deployment, memory-extremely-constrained inference
INT4 (QLoRA training)	0.25× (weights only)	Low for fine-tuning	Memory-efficient fine-tuning of large models

Quick Decision Guide

Tell us your model size and use case, and here is where to start:

Inference, ≤7B: Any modern GPU with 16GB+ VRAM. RTX 4080 or better.
Inference, 13B at FP16: 1× H100 (80GB), A100 (80GB), or L40S (48GB at INT8).
Inference, 70B at FP16: 1× MI300X (192GB) — the only single-GPU option at full precision.
Inference, 70B at INT8: 1× H100 SXM5 (80GB) — tight fit, use paged attention.
Fine-tuning (LoRA), any model up to 70B: 1× A100 or H100 with QLoRA.
Full fine-tuning, 7B: 4× A100 or 2× MI300X with ZeRO-3.
Pretraining, 70B+: 64-GPU H100 or MI300X cluster minimum.

For a personalized recommendation based on your specific model and budget, use our GPU Comparator. For detailed GPU-by-GPU specs and memory capacity, browse our GPU specification pages or compare two GPUs head-to-head on the GPU Comparator.

GPU VRAMhow much VRAMVRAM for LLMGPU memory requirementsmodel VRAM calculatorquantizationFP16 BF16 INT8 INT4

Try Our GPU Tools

Compare GPUs, calculate TCO, and get AI-powered recommendations.

Data Center GPUs More Articles

NVIDIA B300 Ultra vs AMD MI355X: A Deep-Dive into the 2026 Data Center GPU Battle

2026-03-15 · 18 min read

Choosing the Right GPU for LLM Training in 2026: A Practitioner's Guide

2026-03-12 · 20 min read

GPU Cloud Pricing in 2026: We Compared 7 Providers So You Don't Have To

2026-03-10 · 15 min read