GPU & AI Hardware Glossary
Plain-English definitions for every GPU and AI infrastructure term — from TFLOPS and HBM3e to ZeRO and FlashAttention.
BF16
PrecisionBrain Float 16 — the standard training precision for large language models, with the same range as FP32.
BF16 (bfloat16) is a 16-bit floating-point format designed by Google Brain. It has the same 8-bit exponent as FP32 (giving the same dynamic range) but only 7 mantissa bits (vs 23 in FP32). For neural network training, dynamic range matters more than precision — gradients span many orders of magnitude, and overflow/underflow causes training instability. BF16 avoids this problem while cutting memory in half vs FP32. Modern LLM training runs almost entirely in BF16, with FP32 used only for optimizer state in mixed-precision setups.
CUDA
SoftwareNVIDIA's parallel computing platform and programming model for GPU workloads.
CUDA (Compute Unified Device Architecture) is both a hardware architecture and a software development platform from NVIDIA. It allows developers to write programs that run on NVIDIA GPUs using C/C++ extensions. The CUDA ecosystem includes the runtime, compiler (nvcc), and a vast library of hand-optimized routines: cuBLAS (matrix math), cuDNN (neural network primitives), TensorRT (inference optimization), and NCCL (distributed communications). The depth and maturity of this ecosystem — built over 15+ years — is NVIDIA's primary competitive advantage over AMD in AI.
FP8
Precision8-bit floating-point format for AI training and inference — half the size of FP16, with minimal quality loss.
FP8 is an 8-bit floating-point number format introduced with NVIDIA Hopper (H100). There are two variants: E4M3 (4 exponent bits, 3 mantissa) for activations and E5M2 (5 exponent bits, 2 mantissa) for gradients. FP8 requires half the memory of FP16 and doubles Tensor Core throughput. The H100 delivers 1,979 FP8 TFLOPS vs 989 FP16 TFLOPS. The tradeoff is reduced numerical precision, which requires careful scaling (handled automatically by the Transformer Engine) to avoid training instability. For inference, FP8 quantization delivers near-FP16 quality at half the memory footprint.
FP16
Precision16-bit floating-point — standard format for inference, with smaller range than BF16 but higher precision.
FP16 (half-precision) uses 5 exponent bits and 10 mantissa bits. It has less dynamic range than BF16, which can cause overflow during training without careful loss scaling. For inference, FP16 is the standard baseline precision that preserves model quality well. The rule of thumb: FP16 inference requires 2GB of VRAM per billion parameters. Compared to BF16, FP16 has higher numerical precision but is more prone to training instability, which is why BF16 has largely replaced FP16 for training workloads.
FlashAttention
SoftwareAn algorithm that computes attention faster and with less memory by fusing operations and using tiling.
FlashAttention (Tri Dao et al., 2022) rewrites the attention computation to be IO-aware — it avoids reading and writing the large N×N attention matrix to HBM by computing attention in tiles that fit in GPU shared memory (SRAM). The result is 2-4× faster attention with sub-linear memory scaling vs sequence length. FlashAttention-2 and FlashAttention-3 (targeting Hopper's hardware) are now standard in virtually all LLM training and inference frameworks. Without FlashAttention, training with sequences longer than 2K tokens is impractically slow and memory-intensive.
Gradient Checkpointing
TrainingA memory-saving technique that recomputes activations during the backward pass instead of storing them.
During neural network training, the forward pass generates intermediate activations that must be stored for use during the backward pass. For large models, these activations can consume more memory than the model weights themselves. Gradient checkpointing (activation recomputation) discards most activations after the forward pass and recomputes them on-demand during backpropagation. This reduces activation memory by 4-10× at the cost of ~33% more compute (each activation is computed twice). It is almost universally used for training models larger than 7B parameters.
HBM3e
MemoryHigh Bandwidth Memory 3e — the fastest GPU memory standard used in H200, MI300X, and B200.
HBM (High Bandwidth Memory) is a type of DRAM stacked directly on the same silicon package as the GPU die, connected via a very wide memory bus. HBM3e is the third generation with enhancements: it offers up to 9.2 Gbps per pin and delivers memory bandwidths of 4,800–8,000 GB/s in current data center GPUs. Compare this to GDDR6X (consumer GPUs), which maxes out around 1,000 GB/s. The bandwidth difference is critical for AI workloads because transformer models are largely memory-bandwidth-bound during inference.
InfiniBand
InterconnectHigh-speed network interconnect for connecting multiple GPU nodes in a cluster.
InfiniBand is the dominant networking fabric for large GPU clusters. While NVLink connects GPUs within the same server node, InfiniBand connects nodes together. NDR InfiniBand (Next Data Rate) provides 400 Gbps per port — roughly 3× faster than 100GbE Ethernet. HDR provides 200 Gbps. For large-scale distributed training (32+ nodes), InfiniBand bandwidth and latency directly affect how efficiently GPUs can synchronize gradients. NVIDIA's NCCL library is optimized for InfiniBand topologies.
INT8
Precision8-bit integer quantization for inference — halves memory vs FP16 with 1-3% quality tradeoff.
INT8 quantization represents model weights as 8-bit integers, reducing memory by 2× vs FP16. This is the most common production quantization level — it cuts VRAM requirements in half while delivering acceptable quality for most tasks. A 70B model that requires 140GB at FP16 fits in 70GB at INT8, enabling deployment on a single H100 SXM5. Libraries like bitsandbytes (LLM.int8) and NVIDIA's TensorRT handle INT8 quantization automatically. The quality impact is model-dependent but typically 1-3% on standard benchmarks.
KV Cache
InferenceMemory used to store attention key and value states during LLM token generation.
During autoregressive LLM inference, every generated token attends to all previous tokens. The KV cache stores the key and value tensors for all previous tokens so they do not need to be recomputed. KV cache memory scales as: 2 × num_layers × num_heads × head_dim × sequence_length × batch_size × precision_bytes. For a Llama 3 70B model generating 4K tokens with batch size 32 in FP16, the KV cache consumes approximately 32GB — nearly half the GPU's memory budget on an H100. Managing KV cache efficiently (paged attention, flash attention) is one of the primary challenges in LLM serving.
LoRA
Fine-tuningLow-Rank Adaptation — an efficient fine-tuning method that trains small adapter matrices instead of all parameters.
LoRA (Hu et al., 2021) is a parameter-efficient fine-tuning method that freezes the pretrained model weights and injects trainable low-rank decomposition matrices into each transformer layer. Instead of training all 70 billion parameters of a 70B model, LoRA trains only the adapter matrices (typically 0.1-1% of parameters). This reduces training memory by 4-8× and allows fine-tuning very large models on limited GPU memory. QLoRA extends this by also quantizing the frozen base model to 4-bit, enabling 70B fine-tuning on a single 80GB GPU.
Memory Bandwidth
MemoryHow fast data moves between GPU memory and compute cores, measured in GB/s.
Memory bandwidth determines how quickly the GPU can read model weights and write results. For LLM inference, bandwidth is often more important than raw TFLOPS, because token generation requires reading all model weights from memory for each token. A GPU with 2× the bandwidth can generate tokens 2× faster for memory-bound workloads, even if compute TFLOPS are identical. The H100 SXM5 has 3,350 GB/s; the MI300X has 5,300 GB/s; the B200 has 8,000 GB/s.
MFU
PerformanceModel FLOPS Utilization — the fraction of theoretical GPU TFLOPS actually used by a workload.
MFU measures how efficiently a workload uses available GPU compute. A GPU rated at 1,000 TFLOPS running at 40% MFU is effectively delivering 400 TFLOPS of useful work. MFU below 50% is common for LLM training; elite clusters achieve 55-65%. The gap between theoretical and actual is caused by memory bandwidth bottlenecks, inter-GPU communication overhead, kernel launch latency, and suboptimal batch sizes. MFU is the best single metric for evaluating training efficiency — higher MFU means you are getting more value from your GPU investment.
Model Parallelism
Distributed TrainingSplitting a model across multiple GPUs when it is too large to fit in one GPU's memory.
Model parallelism distributes model parameters across GPUs to handle models larger than a single GPU's memory. There are several types: tensor parallelism (splitting individual layers across GPUs, used within a node with NVLink), pipeline parallelism (assigning different layers to different GPUs, used across nodes), and sequence parallelism (splitting the sequence dimension of activations). Most large-scale LLM training uses a combination — typically 8-way tensor parallelism within a node and pipeline parallelism across nodes.
NVLink
InterconnectNVIDIA's proprietary high-speed GPU-to-GPU interconnect, enabling fast data sharing within a node.
NVLink connects multiple GPUs in the same server node at bandwidths far exceeding PCIe. NVLink 4.0 (H100) provides 900 GB/s bidirectional bandwidth per GPU; NVLink 6 (B300) provides 1,800 GB/s. This matters for distributed training because all-reduce gradient synchronization requires moving large tensors between GPUs — higher bandwidth means faster gradient sync and higher scaling efficiency. PCIe 5.0, by contrast, provides only 128 GB/s bidirectional. GPUs in the same NVLink fabric can share memory directly, enabling larger effective memory pools.
NVSwitch
HardwareNVIDIA's switch chip that provides all-to-all NVLink connectivity between all GPUs in a node.
NVSwitch is a dedicated switching chip that creates a non-blocking all-to-all NVLink fabric between all GPUs in an HGX/DGX node. Without NVSwitch, each GPU can only directly connect to a limited number of peers. With NVSwitch, any GPU can communicate with any other GPU at full NVLink bandwidth, dramatically improving collective operation efficiency (all-reduce, all-gather). An 8-GPU H100 HGX node uses 4× NVSwitch chips to provide 900 GB/s bandwidth from each GPU to any other GPU simultaneously.
NCCL
SoftwareNVIDIA Collective Communications Library — handles all-reduce and other GPU-to-GPU operations in distributed training.
NCCL (NVIDIA Collective Communications Library) implements collective operations — all-reduce, all-gather, broadcast, reduce-scatter — optimized for NVIDIA GPU topologies. During data-parallel training, gradient synchronization is implemented as an all-reduce operation across all GPUs; NCCL orchestrates this communication using the fastest available interconnect (NVLink within a node, InfiniBand across nodes). PyTorch's distributed training backend (torch.distributed) uses NCCL by default for NVIDIA GPUs. AMD's equivalent is RCCL (ROCm Collective Communications Library).
PCIe
InterconnectPCI Express — the standard system bus connecting GPUs to the host CPU, used in PCIe GPU variants.
PCIe (Peripheral Component Interconnect Express) is the universal expansion bus standard used in servers. PCIe 4.0 × 16 provides 64 GB/s bidirectional bandwidth; PCIe 5.0 doubles this to 128 GB/s. For GPU-to-CPU data transfers (loading data, checkpointing), PCIe bandwidth is often the bottleneck. For GPU-to-GPU communication in multi-GPU servers, PCIe is much slower than NVLink — a key reason why PCIe GPUs scale less efficiently in distributed training than SXM form factor GPUs with NVLink.
Quantization
SoftwareReducing the numerical precision of model weights to decrease memory usage and increase throughput.
Quantization converts high-precision weights (FP16, BF16) to lower-precision representations (INT8, INT4, FP8) to reduce memory footprint and improve inference speed. Popular approaches include post-training quantization (PTQ) methods like GPTQ and AWQ, which analyze the weight distribution to minimize accuracy loss, and quantization-aware training (QAT). INT4 quantization (GPTQ, AWQ) reduces memory by 4× vs FP16 with 3-8% quality degradation depending on the model and task. For inference deployment, quantization is often the primary lever for fitting larger models on available hardware.
ROCm
SoftwareAMD's open-source GPU computing platform, the primary alternative to CUDA for AI workloads.
ROCm (Radeon Open Compute) is AMD's answer to CUDA — an open-source platform for GPU computing that supports AI frameworks like PyTorch and JAX on AMD hardware. ROCm includes HIP (a CUDA-compatible programming language), rocBLAS (matrix operations), MIOpen (neural network primitives), and RCCL (collective communications). ROCm 6.x in 2026 achieves near-parity for standard PyTorch training workloads but still lags on inference optimization (no TensorRT equivalent) and custom kernel support. See our CUDA vs ROCm comparison for a full breakdown.
Scaling Efficiency
PerformanceHow efficiently training throughput increases as you add more GPUs.
Perfect scaling efficiency (100%) means 8 GPUs train exactly 8× faster than 1 GPU. In practice, communication overhead (gradient synchronization, all-reduce operations) reduces this. A well-tuned 8-GPU H100 node achieves 87-94% scaling efficiency for LLM training. Scaling efficiency degrades as you add more nodes, because cross-node InfiniBand bandwidth is lower than intra-node NVLink bandwidth. At 64-node scale (512 GPUs), efficient setups achieve 75-85% scaling, meaning you get the equivalent compute of 400-435 ideal GPUs from 512 physical GPUs.
SXM
HardwareServer PCI Express Module — NVIDIA's high-power form factor for data center GPUs with NVLink support.
SXM is a GPU form factor designed for dense server deployments. SXM GPUs mount directly to a baseboard (HGX/DGX/MGX) rather than plugging into a standard PCIe slot. This enables higher power delivery (700W+ vs ~350W for PCIe cards) and direct NVLink connections to other GPUs on the same baseboard. SXM form factor H100 GPUs deliver significantly higher performance than H100 PCIe — higher TDP, more memory bandwidth, and NVLink vs PCIe for multi-GPU communication. Most serious AI training infrastructure uses SXM GPUs.
TFLOPS
PerformanceTera floating-point operations per second — the primary measure of GPU compute throughput.
One TFLOP equals one trillion floating-point math operations per second. GPU TFLOPS ratings are quoted at a specific precision (FP32, FP16, BF16, FP8). Higher-precision formats like FP32 have lower TFLOPS than lower-precision FP8, because each operation takes more silicon area. A GPU rated at 989 FP16 TFLOPS (NVIDIA H100) can perform 989 trillion FP16 multiply-add operations per second under ideal conditions. Real-world throughput (measured as Model FLOPS Utilization, or MFU) is always lower than the theoretical peak.
Tensor Core
HardwareSpecialized compute units in NVIDIA GPUs designed for matrix multiply-accumulate operations.
Tensor Cores, introduced with NVIDIA Volta (2017), are dedicated hardware units that perform matrix multiply-accumulate (MMA) operations at significantly higher throughput than standard CUDA cores. Modern Tensor Cores (Hopper, Blackwell generation) support multiple precision formats including FP64, TF32, FP16, BF16, FP8, and INT8. The TFLOPS ratings advertised for AI workloads (e.g., 989 FP16 TFLOPS for H100) refer specifically to Tensor Core throughput. AMD's equivalent is the Matrix Core unit in their CDNA architecture.
Transformer Engine
HardwareNVIDIA hardware and software for accelerating transformer model training using FP8 precision.
The Transformer Engine, introduced with the H100, enables automatic FP8 mixed-precision training for transformer architectures. It dynamically adjusts scaling factors between FP8 and higher-precision formats to maintain training stability. In practice, FP8 training via the Transformer Engine delivers 1.5–2× throughput improvement over BF16 training with minimal accuracy degradation. However, using the Transformer Engine requires either NVIDIA's model libraries (NeMo, Megatron-LM) or custom FP8 training code — it is not automatic with standard PyTorch. Blackwell's B300 expands this to FP4 for even higher throughput.
TDP
HardwareThermal Design Power — the maximum sustained power draw of a GPU, measured in watts.
TDP defines how much power a GPU consumes under sustained load and, consequently, how much heat it generates. Data center GPUs range from 300W (older/smaller GPUs) to 1,000W (NVIDIA B300 Ultra). TDP determines cooling requirements (air vs liquid), power supply sizing, and facility power costs. An H100 SXM5 at 700W running 24/7 for a year consumes ~6,132 kWh, costing ~$550/year at $0.09/kWh — before system overhead. For large clusters, power cost is often the second largest TCO component after hardware.
TCO
BusinessTotal Cost of Ownership — the full 3-year cost of a GPU deployment including hardware, power, and operations.
TCO analysis accounts for every cost over a multi-year deployment horizon: hardware purchase, server infrastructure, networking, power consumption, cooling, colocation or data center space, and personnel. For GPU infrastructure, power is often underestimated — a 64-GPU H100 cluster running at 70% utilization costs approximately $340,000 in power over 3 years at standard colocation rates. TCO analysis frequently reveals that a more expensive GPU with better performance-per-watt delivers lower 3-year TCO than a cheaper but less efficient option.
vLLM
SoftwareAn open-source high-throughput LLM inference serving library using paged attention.
vLLM (Virtual LLM) is a popular open-source framework for serving large language models efficiently. Its key innovation is PagedAttention — a technique borrowed from OS virtual memory management that stores KV cache in non-contiguous memory pages, dramatically reducing memory fragmentation. vLLM achieves 2-24× higher throughput than naive HuggingFace inference by batching requests and managing KV cache efficiently. It is the most widely used LLM serving framework as of 2026 and supports both CUDA (NVIDIA) and ROCm (AMD) backends.
ZeRO Optimizer
Distributed TrainingDeepSpeed's memory-efficient distributed training strategy that shards optimizer state across GPUs.
ZeRO (Zero Redundancy Optimizer) is a memory optimization strategy from Microsoft DeepSpeed that partitions the optimizer state, gradients, and model parameters across all GPUs in a data-parallel group. Stage 1 shards optimizer state (8× memory reduction for Adam), Stage 2 adds gradient sharding (another 2× reduction), Stage 3 adds parameter sharding (scales with number of GPUs). ZeRO makes it possible to train very large models across many GPUs without each GPU needing to hold the full optimizer state. A ZeRO-3 configuration on 64 GPUs can train models 64× larger than fit on a single GPU.
Ready to Compare GPUs?
Put these terms into practice — compare real GPU specs, benchmark data, and cloud pricing.