GPU Sizing for RAG
How much VRAM your RAG system actually needs — by document scale, concurrent users, context length, and model choice. With concrete GPU configs and monthly cost estimates.
VRAM Breakdown by Component
Add these together to estimate your total VRAM requirement.
LLM (Llama 3.1 8B, FP16)
~16GB
Weights only. Add KV cache separately.
LLM (Llama 3.1 8B, FP8)
~8GB
Half the weight memory, near-identical quality.
LLM (Llama 3.1 70B, FP16)
~140GB
Needs 2× H100 or 1× MI300X.
LLM (Llama 3.1 70B, FP8)
~70GB
Fits on single H100 80GB with some headroom.
Embedding model (bge-large-en-v1.5, 335M)
~1.3GB
Can run on same GPU as LLM.
Embedding model (text-embedding-3-large)
~1.5GB
OpenAI API = no GPU needed.
KV cache per concurrent user (8K ctx, 8B LLM)
~0.5GB
Scales with batch and context length.
KV cache per concurrent user (8K ctx, 70B LLM)
~3–5GB
GQA reduces vs MHA models.
KV cache per concurrent user (32K ctx, 70B LLM)
~12–20GB
Long context RAG requires more.
Framework overhead (vLLM, Python)
~2–4GB
Fixed overhead regardless of model.
GPU Configs by Scale
Dev / Prototype
Documents
< 50K
Concurrent Users
1–5
LLM
Llama 3.1 8B FP8
Context
4K
Recommended GPU
1× L40S (48GB) or 1× A10G
Lambda: ~$1.40/hr
VRAM Breakdown
8B FP8 (8GB) + 5× KV (2.5GB) + embed (1.3GB) + overhead (3GB) ≈ 15GB
Small Production
Documents
50K–500K
Concurrent Users
10–25
LLM
Llama 3.1 70B FP8
Context
8K
Recommended GPU
1× H100 80GB or 1× MI300X
Lambda: $2.49–3.49/hr
VRAM Breakdown
70B FP8 (70GB) + 25× KV (75GB) ← needs MI300X 192GB at full utilization
Medium Production
Documents
500K–5M
Concurrent Users
50–100
LLM
Llama 3.1 70B FP8
Context
16K
Recommended GPU
2× H100 or 1× MI300X + 1× H100
Lambda: $4.98–6.98/hr
VRAM Breakdown
70B FP8 (70GB) + 100× 16K-KV (~150GB) → distributed across 2 GPUs minimum
Large Production
Documents
5M+
Concurrent Users
200+
LLM
Llama 3.1 70B or 405B FP8
Context
32K
Recommended GPU
4–8× H100 or 4× MI300X + managed vector DB
Lambda: $9.96–27.96/hr
VRAM Breakdown
Distributed inference cluster. Vector DB (Weaviate/Pinecone) on separate CPU infra.
Vector Database by Scale
pgvector
< 1M vectors
No GPU needed
Postgres extension. Best for < 500K docs. CPU-only ANN search.
Chroma
< 2M vectors
No GPU needed
Simple local deployment. Good for dev and small production.
Qdrant
< 50M vectors
CPU (GPU optional)
High-performance Rust-based. GPU search acceleration optional.
Weaviate
< 500M vectors
GPU for embedding inference
Managed cloud or self-hosted. Built-in embedding model serving.
Pinecone
Billions
Fully managed
Managed cloud only. No infra to manage. Premium pricing.
Milvus
Billions
GPU-accelerated search
Open-source. GPU-accelerated similarity search at scale.
Architecture Decisions
Context length is the biggest VRAM multiplier
Going from 4K to 32K context per user increases KV cache 8×. For 50 concurrent users with Llama 70B, 4K context needs ~175GB KV cache; 32K context needs ~1.4TB. Use the shortest context that serves your use case.
MI300X is ideal for single-GPU RAG at scale
MI300X's 192GB VRAM fits Llama 3.1 70B FP8 (70GB) + 25 concurrent users at 8K context (100GB KV cache) on a single GPU — avoiding multi-GPU tensor parallelism for inference. Lambda Labs price: $3.49/hr.
Split embedding and generation services in production
In production, run your embedding model on a small dedicated GPU (L4 at $0.68/hr) and your LLM on H100/MI300X. This prevents embedding batch jobs from evicting LLM KV cache and allows independent scaling.
Vector DB is almost never the bottleneck
For < 5M documents, CPU-based vector search (Qdrant, Chroma) retrieves in < 50ms. The LLM generation takes 500ms–5s. GPU-accelerate the LLM first, not the vector DB.
FAQs
How much GPU VRAM do I need for a RAG system?
Minimum VRAM = LLM weights + KV cache for all concurrent users + embedding model + framework overhead. For Llama 3.1 8B (FP8) serving 10 concurrent users at 4K context: 8GB (weights) + 5GB (KV cache) + 1.3GB (embeddings) + 3GB (overhead) ≈ 17GB. A single L40S (48GB) or H100 (80GB) handles this easily. For Llama 3.1 70B with 50 concurrent users at 8K context, you need 140–200GB total VRAM (2× H100 or 1× MI300X).
Should the LLM and embedding model run on the same GPU?
For small-scale RAG (< 20 concurrent users), yes — the embedding model is tiny (1–2GB) and easily shares GPU memory with the LLM. At production scale, it's better to have a dedicated embedding inference service (separate L4 or T4 GPU for ~$0.40/hr) to avoid contention. Alternatively, use OpenAI or Cohere for embeddings via API if latency requirements allow.
What is the best GPU for a RAG system serving 100 concurrent users?
For 100 concurrent users at 8K context using Llama 3.1 70B FP8: KV cache alone needs ~350–500GB (100 users × 3.5–5GB each). This requires 2× MI300X (384GB total) minimum with careful memory management, or 4× H100 (320GB) with tensor parallelism. A simpler option: use a smaller model (8B FP8 at 0.5GB KV cache per user), which fits 100 concurrent users on a single MI300X (192GB).
Do I need a GPU for the vector database in RAG?
For most RAG deployments, no. Vector similarity search (ANN) runs efficiently on CPUs for collections under 10M vectors. GPU-accelerated vector search (Milvus, Qdrant GPU) becomes beneficial above 50M+ vectors where GPU parallelism meaningfully reduces search latency. The LLM generation is the primary GPU-intensive step in RAG — the vector retrieval is usually CPU-bound.