Skip to content

Best GPU for DeepSeek R1 & V3 in 2026

DeepSeek R1 (671B MoE) and V3 (685B) are the largest open-weight models in production use. They require 600–700GB of memory in FP8 — demanding multi-GPU configurations. Here's the ranked guide for production inference cost and performance.

TL;DR

For DeepSeek R1 671B: 4× MI300X is the cheapest production option at ~$3.52/M tokens. 8× H100 works but costs ~$7.90/M tokens. For the distilled 70B variant (DeepSeek-R1-Distill-Llama-70B), a single MI300X at $3.49/hr is the best deal.

TOP 5 GPUS RANKED

#1

AMD Instinct MI300X

AMDTOP PICK

Best value for full DeepSeek R1 671B inference

Memory

192GB HBM3

FP8 TFLOPS

2,614 TFLOPS

TDP

750W

Cloud Cost

~$3.49/hr

Pros

  • +4× MI300X (768GB) runs DeepSeek R1 671B in FP8 — cheapest production config
  • +$13.96/hr for 4 GPUs vs $19.92/hr for 8× H100
  • +~1,050 tok/s on R1 671B → $3.52/M tokens (vs $7.90/M on 8× H100)
  • +Single MI300X runs DeepSeek-R1-Distill-70B at full FP16 — no quantization

Cons

  • ROCm vLLM support for MoE models requires recent vLLM 0.4+ with MI300X fixes
  • Expert routing kernels for DeepSeek MoE slightly slower than CUDA
#2

NVIDIA H100 SXM5

NVIDIA

Best ecosystem support — proven with DeepSeek

Memory

80GB HBM3

FP8 TFLOPS

3,958 TFLOPS

TDP

700W

Cloud Cost

~$2.49/hr

Pros

  • +Most battle-tested vLLM deployment for DeepSeek (community scripts widely available)
  • +8× H100 (640GB) runs R1 671B with INT4 AWQ quantization
  • +TensorRT-LLM DeepSeek support for maximum throughput
  • +Best availability and spot pricing (Lambda, RunPod, CoreWeave)

Cons

  • 8× H100 costs $19.92/hr vs $13.96/hr for 4× MI300X
  • INT4 quantization required for R1 on 8× H100 — slight quality loss vs FP8 on MI300X
  • ~700 tok/s on R1 671B → $7.90/M tokens (2.2× more expensive than MI300X)
#3

NVIDIA B200

NVIDIA

Fastest DeepSeek inference — limited availability

Memory

192GB HBM3e

FP8 TFLOPS

4,500 TFLOPS

TDP

1,000W

Cloud Cost

~$6.99/hr (CoreWeave)

Pros

  • +4× B200 (768GB) runs R1 671B FP8 — same as MI300X config but faster
  • +~3,000 tok/s on R1 671B (vs 1,050 for MI300X) — 3× more throughput
  • +Excellent TensorRT-LLM DeepSeek support
  • +Best if SLA requires low latency

Cons

  • 4× B200 costs ~$27.96/hr → $2.60/M tokens (cheapest per token at large scale)
  • Very limited cloud availability in early 2026
  • High TDP (4,000W for 4 GPUs) requires dedicated infrastructure
#4

AMD Instinct MI355X

AMD

Single-GPU option for full R1 with aggressive quantization

Memory

288GB HBM3e

FP8 TFLOPS

4,610 TFLOPS

TDP

1,400W

Cloud Cost

~$5.50/hr

Pros

  • +288GB — only single GPU with enough VRAM for R1 671B in INT4 (~168GB)
  • +No multi-GPU setup required for running R1 671B
  • +FP8 capacity (671GB) allows 2× MI355X (576GB) to run R1 at FP8
  • +Best for teams wanting minimal cluster complexity

Cons

  • 1,400W TDP — high infrastructure requirements
  • Limited cloud availability
  • INT4 quality loss on R1 more noticeable than FP8
#5

NVIDIA H200 SXM

NVIDIA

Best for DeepSeek distilled 70B models

Memory

141GB HBM3e

FP8 TFLOPS

3,958 TFLOPS

TDP

700W

Cloud Cost

~$4.50/hr

Pros

  • +141GB fits DeepSeek-R1-Distill-70B FP16 on a single GPU
  • +4.8 TB/s bandwidth → faster inference than H100 for 70B
  • +Full CUDA ecosystem for DeepSeek distilled variants
  • +Better than H100 for large-batch serving of 70B models

Cons

  • 141GB is not enough for full R1 671B (even INT4 at 168GB is tight)
  • Higher cost than H100 for small 70B distilled model serving

KEY FACTORS TO CONSIDER

DeepSeek R1 671B memory requirements

DeepSeek R1 is a 671B parameter MoE model with 37B active parameters per forward pass. Memory requirements: FP16 = ~1.34TB, FP8 = ~671GB, INT8 = ~671GB, INT4 (AWQ) = ~168GB. Only 4× MI300X (768GB), 4× B200 (768GB), or 8× H100 + INT4 can serve the full R1 model. For 2× MI355X (576GB) you need FP8.

Distilled models are 3–5× cheaper to serve

DeepSeek released distilled variants: R1-Distill-Llama-70B and R1-Distill-Qwen-32B. These are regular dense transformers (not MoE) trained to mimic R1's reasoning. A single H100 at $2.49/hr runs the 70B distilled model at 1,800 tok/s → $0.38/M tokens. If you don't need the full 671B model, the distilled 70B is dramatically cheaper.

vLLM support for MoE models

vLLM 0.4+ supports DeepSeek's MoE architecture including expert parallelism across multiple GPUs. CUDA performance on H100 is most mature. ROCm/MI300X support is solid in vLLM 0.5+ with DeepSeek-specific optimizations. TensorRT-LLM provides further speedups on NVIDIA but requires more setup time.

Quantization tradeoffs for R1 671B

FP8 on MI300X: highest quality, requires 4 GPUs (768GB). INT8 on H100: good quality, same 8-GPU constraint as INT4. INT4 AWQ on 8× H100 (640GB): slight quality loss (~1–2% on benchmarks), but fits. Community benchmarks show FP8 outperforms INT4 on complex reasoning tasks, which matters specifically for R1's chain-of-thought outputs.

FREQUENTLY ASKED QUESTIONS

How many GPUs do I need to run DeepSeek R1 671B?

Minimum: 4× AMD MI300X (192GB each = 768GB total) in FP8, or 8× NVIDIA H100 80GB with INT4 AWQ quantization (640GB total), or 4× NVIDIA B200 (192GB each = 768GB total) in FP8. A single MI355X (288GB) can run R1 with INT4 quantization (~168GB). Two MI355X is enough for FP8.

What is the cheapest cloud setup to run DeepSeek R1?

4× AMD MI300X on Lambda Labs at $13.96/hr with ~1,050 tokens/sec delivers $3.52/million tokens — the cheapest production option. Compared to 8× H100 at $19.92/hr and ~700 tok/s ($7.90/M tokens), the MI300X setup saves ~55% per token for DeepSeek R1 671B inference.

Can I run DeepSeek R1 on a single GPU?

Not the full 671B model on any currently available single GPU. The smallest single-GPU option is AMD MI355X (288GB) running R1 671B with INT4 quantization (~168GB), but this significantly impacts the model's chain-of-thought reasoning quality. The DeepSeek-R1-Distill-Llama-70B variant runs on a single MI300X (192GB), H200 (141GB), or two H100s.

Is AMD or NVIDIA better for DeepSeek inference?

AMD MI300X is better for cost per token on the full R1 671B model due to 192GB VRAM enabling FP8 serving on 4 GPUs vs H100's need for 8 GPUs with INT4. NVIDIA is better if you need maximum throughput (H100 with TensorRT-LLM or B200 if available) or if you're running smaller distilled DeepSeek variants where H100's per-hour cost advantage matters.

GPU Pricing Pulse

Weekly digest of GPU cloud price changes, new hardware releases, and infrastructure deals.