Skip to content
RAG Infrastructure Guide

GPU Sizing for RAG

How much VRAM your RAG system actually needs — by document scale, concurrent users, context length, and model choice. With concrete GPU configs and monthly cost estimates.

VRAM Breakdown by Component

Add these together to estimate your total VRAM requirement.

ComponentVRAMNotes

LLM (Llama 3.1 8B, FP16)

~16GB

Weights only. Add KV cache separately.

LLM (Llama 3.1 8B, FP8)

~8GB

Half the weight memory, near-identical quality.

LLM (Llama 3.1 70B, FP16)

~140GB

Needs 2× H100 or 1× MI300X.

LLM (Llama 3.1 70B, FP8)

~70GB

Fits on single H100 80GB with some headroom.

Embedding model (bge-large-en-v1.5, 335M)

~1.3GB

Can run on same GPU as LLM.

Embedding model (text-embedding-3-large)

~1.5GB

OpenAI API = no GPU needed.

KV cache per concurrent user (8K ctx, 8B LLM)

~0.5GB

Scales with batch and context length.

KV cache per concurrent user (8K ctx, 70B LLM)

~3–5GB

GQA reduces vs MHA models.

KV cache per concurrent user (32K ctx, 70B LLM)

~12–20GB

Long context RAG requires more.

Framework overhead (vLLM, Python)

~2–4GB

Fixed overhead regardless of model.

GPU Configs by Scale

Starter

Dev / Prototype

Documents

< 50K

Concurrent Users

1–5

LLM

Llama 3.1 8B FP8

Context

4K

Recommended GPU

1× L40S (48GB) or 1× A10G

Lambda: ~$1.40/hr

VRAM Breakdown

8B FP8 (8GB) + 5× KV (2.5GB) + embed (1.3GB) + overhead (3GB) ≈ 15GB

Standard

Small Production

Documents

50K–500K

Concurrent Users

10–25

LLM

Llama 3.1 70B FP8

Context

8K

Recommended GPU

1× H100 80GB or 1× MI300X

Lambda: $2.49–3.49/hr

VRAM Breakdown

70B FP8 (70GB) + 25× KV (75GB) ← needs MI300X 192GB at full utilization

Scale

Medium Production

Documents

500K–5M

Concurrent Users

50–100

LLM

Llama 3.1 70B FP8

Context

16K

Recommended GPU

2× H100 or 1× MI300X + 1× H100

Lambda: $4.98–6.98/hr

VRAM Breakdown

70B FP8 (70GB) + 100× 16K-KV (~150GB) → distributed across 2 GPUs minimum

Enterprise

Large Production

Documents

5M+

Concurrent Users

200+

LLM

Llama 3.1 70B or 405B FP8

Context

32K

Recommended GPU

4–8× H100 or 4× MI300X + managed vector DB

Lambda: $9.96–27.96/hr

VRAM Breakdown

Distributed inference cluster. Vector DB (Weaviate/Pinecone) on separate CPU infra.

Vector Database by Scale

DatabaseMax VectorsGPU NeedBest For

pgvector

< 1M vectors

No GPU needed

Postgres extension. Best for < 500K docs. CPU-only ANN search.

Chroma

< 2M vectors

No GPU needed

Simple local deployment. Good for dev and small production.

Qdrant

< 50M vectors

CPU (GPU optional)

High-performance Rust-based. GPU search acceleration optional.

Weaviate

< 500M vectors

GPU for embedding inference

Managed cloud or self-hosted. Built-in embedding model serving.

Pinecone

Billions

Fully managed

Managed cloud only. No infra to manage. Premium pricing.

Milvus

Billions

GPU-accelerated search

Open-source. GPU-accelerated similarity search at scale.

Architecture Decisions

Context length is the biggest VRAM multiplier

Going from 4K to 32K context per user increases KV cache 8×. For 50 concurrent users with Llama 70B, 4K context needs ~175GB KV cache; 32K context needs ~1.4TB. Use the shortest context that serves your use case.

MI300X is ideal for single-GPU RAG at scale

MI300X's 192GB VRAM fits Llama 3.1 70B FP8 (70GB) + 25 concurrent users at 8K context (100GB KV cache) on a single GPU — avoiding multi-GPU tensor parallelism for inference. Lambda Labs price: $3.49/hr.

Split embedding and generation services in production

In production, run your embedding model on a small dedicated GPU (L4 at $0.68/hr) and your LLM on H100/MI300X. This prevents embedding batch jobs from evicting LLM KV cache and allows independent scaling.

Vector DB is almost never the bottleneck

For < 5M documents, CPU-based vector search (Qdrant, Chroma) retrieves in < 50ms. The LLM generation takes 500ms–5s. GPU-accelerate the LLM first, not the vector DB.

FAQs

How much GPU VRAM do I need for a RAG system?

Minimum VRAM = LLM weights + KV cache for all concurrent users + embedding model + framework overhead. For Llama 3.1 8B (FP8) serving 10 concurrent users at 4K context: 8GB (weights) + 5GB (KV cache) + 1.3GB (embeddings) + 3GB (overhead) ≈ 17GB. A single L40S (48GB) or H100 (80GB) handles this easily. For Llama 3.1 70B with 50 concurrent users at 8K context, you need 140–200GB total VRAM (2× H100 or 1× MI300X).

Should the LLM and embedding model run on the same GPU?

For small-scale RAG (< 20 concurrent users), yes — the embedding model is tiny (1–2GB) and easily shares GPU memory with the LLM. At production scale, it's better to have a dedicated embedding inference service (separate L4 or T4 GPU for ~$0.40/hr) to avoid contention. Alternatively, use OpenAI or Cohere for embeddings via API if latency requirements allow.

What is the best GPU for a RAG system serving 100 concurrent users?

For 100 concurrent users at 8K context using Llama 3.1 70B FP8: KV cache alone needs ~350–500GB (100 users × 3.5–5GB each). This requires 2× MI300X (384GB total) minimum with careful memory management, or 4× H100 (320GB) with tensor parallelism. A simpler option: use a smaller model (8B FP8 at 0.5GB KV cache per user), which fits 100 concurrent users on a single MI300X (192GB).

Do I need a GPU for the vector database in RAG?

For most RAG deployments, no. Vector similarity search (ANN) runs efficiently on CPUs for collections under 10M vectors. GPU-accelerated vector search (Milvus, Qdrant GPU) becomes beneficial above 50M+ vectors where GPU parallelism meaningfully reduces search latency. The LLM generation is the primary GPU-intensive step in RAG — the vector retrieval is usually CPU-bound.