HBM3e Memory Explained: Why It Matters More Than TFLOPS for AI Workloads
A technical deep-dive into High Bandwidth Memory, why it has become the defining spec for AI GPUs, and how to evaluate memory subsystems when comparing accelerators.
If you have been following the GPU market over the past two years, you have noticed that every product announcement leads with memory specs. NVIDIA announces "288GB HBM3e at 8 TB/s." AMD counters with "256GB HBM3e at 6.4 TB/s." The headlines focus on these numbers more than TFLOPS, and there is a good reason for that shift.
Memory has become the bottleneck for most AI workloads. Understanding why — and understanding what the memory specs actually mean for your workload — is essential for making informed GPU decisions. This article explains the technology and its practical implications.
What Is HBM and Why Does It Exist?
High Bandwidth Memory is a type of DRAM that is physically stacked in layers and connected to the GPU die via a silicon interposer. This is fundamentally different from the GDDR memory used in consumer GPUs, where memory chips sit on the PCB around the GPU and communicate over relatively narrow data buses.
The stacking approach enables two critical advantages. First, you can pack many more memory cells in a small area — HBM3e stacks 12 layers of DRAM, achieving densities of 24GB per stack. Second, the connection between the memory and the GPU uses thousands of parallel wires on the interposer (a 1024-bit interface per stack), which enables bandwidth that is physically impossible with traditional PCB-mounted memory.
An H100 SXM5 has 5 HBM3 stacks providing 80GB and 3,350 GB/s aggregate bandwidth. A B300 Ultra has 12 HBM3e stacks providing 288GB and 8,000 GB/s. Each generation of HBM increases both per-stack capacity and per-pin data rate, and GPU vendors add more stacks to further scale bandwidth and capacity.
Why Memory Bandwidth Is the Bottleneck for AI
The Arithmetic Intensity Problem
Modern AI accelerators can perform enormous numbers of floating-point operations per second — a B300 Ultra hits 2,250 TFLOPS in FP16. But each of those operations needs data to work on, and that data must be read from HBM.
The ratio of compute operations to memory accesses is called "arithmetic intensity." Matrix multiplications in neural networks have high arithmetic intensity — each element read from memory participates in many multiply-accumulate operations. This is why GPUs excel at matrix math; they can reuse data from fast on-chip caches across many compute operations.
However, not all operations in a neural network have high arithmetic intensity. The attention mechanism in transformers — which is the dominant architecture for language models, vision models, and increasingly everything else — has fundamentally lower arithmetic intensity. Computing attention requires reading the entire key-value cache, performing a relatively small amount of computation (softmax and weighted sum), and writing the result. For long sequence lengths, this operation becomes almost entirely memory-bandwidth bound.
Quantifying the Bottleneck
Consider serving a 70B parameter model in FP16 for inference. On every generated token, the entire model's weights (140GB in FP16) must be read from HBM. How fast can you generate tokens?
On an H100 SXM5 (3,350 GB/s): 140GB / 3.35 TB/s = 41.8ms per token ≈ 24 tokens/second
On an H200 SXM (4,800 GB/s): 140GB / 4.8 TB/s = 29.2ms per token ≈ 34 tokens/second
On a B300 Ultra (8,000 GB/s): 140GB / 8.0 TB/s = 17.5ms per token ≈ 57 tokens/second
These are theoretical maximums (real throughput is lower due to attention computation, KV cache reads, and other overhead), but they illustrate the point: memory bandwidth directly determines inference speed for large models. The B300 Ultra generates tokens 2.4x faster than the H100 not because it has more TFLOPS, but because it can read model weights from memory 2.4x faster.
This is why I tell clients to look at GB/s per dollar, not TFLOPS per dollar, when evaluating GPUs for inference workloads.
HBM3 vs HBM3e: What Changed
HBM3e is an evolution of HBM3 with two primary improvements: higher per-pin data rate (9.6 Gbps vs 6.4 Gbps for HBM3) and higher per-stack capacity (24GB vs 16GB for 8-high stacks, or 36GB for 12-high stacks). The "e" stands for "extended" — it is not a new architecture, but a process refinement that enables faster signaling and denser stacking.
The practical impact is significant. An H100 SXM5 with HBM3 offers 80GB at 3,350 GB/s. An H200 SXM with HBM3e offers 141GB at 4,800 GB/s — 76% more capacity and 43% more bandwidth from the same number of stacks. The H200 is architecturally identical to the H100 (same Hopper GPU die), but the HBM3e upgrade alone improves inference throughput by 30-40% on bandwidth-bound workloads.
Memory Capacity vs Memory Bandwidth: Which Matters More?
They matter for different reasons, and the answer depends on your workload.
Memory capacity determines what models you can run without partitioning. A 405B model in FP8 requires approximately 200GB just for weights. If your GPU has less than 200GB, you must split the model across multiple GPUs (tensor parallelism), which introduces communication overhead and complexity. More memory per GPU = simpler deployment = fewer GPUs needed = lower cost.
Memory bandwidth determines how fast you can serve those models. Once the model fits in memory, inference speed is limited by how quickly the GPU can read weights during each forward pass. Higher bandwidth = more tokens per second = lower latency = better user experience (or fewer GPUs needed for the same throughput target).
For training, both matter but in different ways. Capacity determines how much of the model state (weights, gradients, optimizer states, activations) fits on each GPU, which influences your parallelism strategy. Bandwidth affects the speed of the attention computation and gradient updates.
If I had to pick one to prioritize: for inference, prioritize bandwidth. For training, prioritize capacity (because insufficient capacity forces you into complex parallelism strategies that reduce overall efficiency).
How to Evaluate Memory Specs When Comparing GPUs
Here is the framework I use when comparing GPU memory subsystems:
1. Bandwidth per TFLOP (GB/s per TFLOP). This ratio tells you whether the memory system can keep up with the compute units. A GPU with high TFLOPS but low bandwidth will be starved for data on bandwidth-bound workloads. The B300 Ultra at 8,000 / 2,250 = 3.6 GB/s per TFLOP. The H100 at 3,350 / 1,979 = 1.7 GB/s per TFLOP. The B300 Ultra has 2.1x better bandwidth-to-compute balance.
2. Capacity per dollar. How much memory do you get for your money? The MI325X at 256GB for ~$20,000 = 12.8 GB per $1,000. The B300 Ultra at 288GB for ~$40,000 = 7.2 GB per $1,000. AMD offers 78% more memory per dollar.
3. Bandwidth per dollar. The MI325X at 6,400 GB/s for ~$20,000 = 320 GB/s per $1,000. The B300 Ultra at 8,000 GB/s for ~$40,000 = 200 GB/s per $1,000. Again, AMD offers better bandwidth-per-dollar by 60%.
These ratios explain why AMD's value proposition is strong for inference farms where you are optimizing for throughput per dollar rather than peak per-GPU performance.
Compare memory specs across all data center GPUs on our GPU Comparison Tool.
Try Our GPU Tools
Compare GPUs, calculate TCO, and get AI-powered recommendations.