H100 vs MI300X — which GPU should I choose?

H100 if you need the broadest software ecosystem (CUDA, TensorRT, vLLM). MI300X if you need maximum VRAM (192GB vs 80GB) for large model inference. MI300X offers better $/TFLOP but NVIDIA's software stack is more mature.

How much VRAM do I need for LLM inference?

A model needs ~2x its parameter count in GB for FP16 inference (70B model = ~140GB VRAM). With INT8 quantization ~70GB, INT4 ~35GB. A single H100 (80GB) runs 70B at INT8; MI300X (192GB) runs it at full FP16.

Should we buy GPUs or use cloud GPU instances?

If running GPUs 60%+ of the time, on-premise ownership wins on 3-year TCO. Below 40% utilization, cloud is more cost-effective. Many enterprises use hybrid: owned hardware for baseline, cloud for peak demand.

What is the cheapest way to rent an H100 GPU?

As of 2026, H100 cloud pricing ranges from $2.23/hr (Lambda, RunPod spot) to $4+/hr (AWS, Azure on-demand). Reserved instances and spot pricing offer 30-60% savings. CoreWeave and Lambda typically offer the lowest rates.

What GPU is best for LLM training in 2026?

NVIDIA H200 SXM (141GB HBM3e) for proven clusters, B200 for next-gen 4-5x speedup over H100, or AMD MI300X (192GB) for budget-conscious teams. For JAX workloads, Google TPU v5p pods offer unmatched scale.

Blog/Data Center

Data Center2026-04-1715 min read

V100 to H100 Upgrade in 2026: Real TCO Numbers and When to Switch

Running NVIDIA V100 clusters? Here is exactly when upgrading to H100 pays off, with detailed performance comparisons, real cloud pricing, and a 3-year TCO model.

The V100 was the defining GPU of the 2018–2022 era. If you run machine learning infrastructure today, there is a good chance some portion of your fleet is still V100 — either owned hardware or cloud instances you have been reluctant to move off because the economics looked acceptable and migration carries risk.

This guide is for people who need to make a concrete decision: stay on V100, upgrade to H100, or consider something in between. I am going to give you real numbers, not marketing comparisons.

V100 vs H100: The Actual Performance Gap

The spec sheet says H100 SXM5 has 3,958 FP8 TFLOPS and 989 FP16 TFLOPS (Tensor Core). V100 SXM2 has 125 FP16 TFLOPS (Tensor Core). That is an approximately 8× raw compute gap.

In practice, the gap on real workloads is closer to 4–6×, not 8×, because:

Real workloads are not purely compute-bound — they mix compute, memory access, and communication
H100's software optimizations (FlashAttention 2, TensorRT-LLM, FP8 kernels) need to be enabled to realize peak throughput
Memory bandwidth also constrains performance: H100's 3,350 GB/s vs V100's 900 GB/s gives a 3.7× bandwidth advantage

Benchmark: LLaMA 3 70B Training

GPU	Throughput (tokens/sec, 8 GPUs)	Training time (300B tokens)	Cost (Lambda Labs)
V100 SXM2 32GB ×8	~8,000	~435 days	~$0.64/hr/GPU
A100 SXM4 80GB ×8	~28,000	~124 days	~$1.80/hr/GPU
H100 SXM5 80GB ×8	~55,000	~63 days	~$2.49/hr/GPU

Key takeaway: H100 trains LLaMA 70B in 63 days vs V100's 435 days — a 6.9× speedup. But at 3.9× the hourly cost, the cost per token trained is actually 1.8× better on H100. More speed and better cost efficiency.

Benchmark: Inference — Llama 3 8B (batch size 32)

GPU	Tokens/sec	$/million tokens
V100 SXM2 32GB	~620	~$0.29
A100 80GB	~2,100	~$0.24
H100 SXM5	~4,800	~$0.14
L40S (48GB)	~3,200	~$0.12

H100 produces 7.7× more tokens per second than V100, at 3.9× the hourly rate — giving 2× better cost efficiency per token. L40S is even better for inference at roughly $0.12/million tokens.

When Should You Actually Upgrade?

Not everyone should rush to upgrade. Here is a decision framework:

Upgrade NOW if:

You are training models with more than 7B parameters regularly — V100's 32GB becomes a bottleneck requiring heavy tensor/pipeline parallelism
Your training jobs take more than 2 weeks — the compute efficiency gap means you are paying more total dollars on V100 for the same result
You need BF16 training (not supported on V100 — Volta lacks native BF16 Tensor Cores)
You are using FlashAttention 2, FP8, or any Hopper-specific kernel
Your V100 hardware is past 4 years old and starting to see DRAM errors

Stay on V100 if:

You run inference on models ≤3B parameters — V100 handles this adequately
Your workloads are not transformer-based (CNNs, RNNs) and do not use Tensor Core operations
You are cloud-only and already getting V100 spot at <$0.40/hr — the math may still work for short experiments
Migration risk is high (custom CUDA kernels, specific CUDA version dependencies) and stability is paramount

Consider A100 as a middle step if:

Budget is tight but V100 is clearly the bottleneck
A100 spot pricing on Lambda/CoreWeave runs around $1.00–1.40/hr — substantially less than H100
A100 adds BF16 native support, 80GB VRAM (vs V100's 32GB), and 3× the FP16 Tensor Core throughput

The 3-Year TCO Model

Let us model 8 GPUs running at 80% utilization for 3 years on cloud (Lambda Labs pricing):

Scenario	GPU	Hourly cost (8 GPUs)	Annual compute cost	3-year total	Effective TFLOPS/$ over 3yr
Stay on V100	8× V100 ×$0.64	$5.12/hr	$35,900	$107,700	1.0× (baseline)
Upgrade to A100	8× A100 ×$1.80	$14.40/hr	$100,900	$302,600	2.1×
Upgrade to H100	8× H100 ×$2.49	$19.92/hr	$139,500	$418,500	3.2×

Pure cost: V100 is cheapest. But the real question is cost per unit of work done. If H100 trains a model 6× faster, you spend 3.9× more per hour but get 6× more done — meaning you can complete 3 years of V100 work in under 6 months on H100.

For research teams with deadlines, the time-to-result advantage of H100 often justifies the cost premium even when the dollar/TFLOP looks worse.

Migration Considerations

Code compatibility

Standard PyTorch/CUDA code runs on H100 without modification. Exceptions:

Custom CUDA kernels that target Volta (SM70) need a recompile targeting Hopper (SM90)
Any code that uses torch.cuda.amp with float16 but not bfloat16 may need adjustment — H100 prefers BF16
Custom INT8 kernels designed for Turing may need updates for Hopper's INT8 Tensor Core instructions

Library versions

H100 requires CUDA 11.8 minimum; CUDA 12.x is recommended. Ensure your PyTorch version supports CUDA 12. The switch from CUDA 10/11 on older V100 deployments to CUDA 12 on H100 is the most common migration headache.

Checkpoint compatibility

Model checkpoints saved on V100 load on H100 without issues. Optimizer states are also portable across GPU generations. The only complication is FP8 checkpoints — if you switch to FP8 training on H100, those checkpoints cannot be loaded on V100.

Decision Summary

For most teams doing active AI development in 2026, V100 is no longer the right primary training platform. The 4–7× throughput disadvantage creates real competitive and time pressure. The upgrade path that makes sense depends on your budget:

Max performance: H100 SXM5 — best cost-per-result for LLM training
Budget upgrade: A100 80GB — 3× V100 throughput at 2.8× the cost, native BF16
Inference only: L40S — beats V100 inference throughput at roughly the same cost

Use our TCO Calculator to model your specific workload and budget, or compare V100 directly against H100 on the V100 vs H100 comparison page.

V100H100upgradeTCONVIDIAdata centerinfrastructure

Try Our GPU Tools

Compare GPUs, calculate TCO, and get AI-powered recommendations.

Data Center GPUs More Articles

NVIDIA B300 Ultra vs AMD MI355X: A Deep-Dive into the 2026 Data Center GPU Battle

2026-03-15 · 18 min read

Choosing the Right GPU for LLM Training in 2026: A Practitioner's Guide

2026-03-12 · 20 min read

GPU Cloud Pricing in 2026: We Compared 7 Providers So You Don't Have To

2026-03-10 · 15 min read