H100 vs MI300X — which GPU should I choose?

H100 if you need the broadest software ecosystem (CUDA, TensorRT, vLLM). MI300X if you need maximum VRAM (192GB vs 80GB) for large model inference. MI300X offers better $/TFLOP but NVIDIA's software stack is more mature.

How much VRAM do I need for LLM inference?

A model needs ~2x its parameter count in GB for FP16 inference (70B model = ~140GB VRAM). With INT8 quantization ~70GB, INT4 ~35GB. A single H100 (80GB) runs 70B at INT8; MI300X (192GB) runs it at full FP16.

Should we buy GPUs or use cloud GPU instances?

If running GPUs 60%+ of the time, on-premise ownership wins on 3-year TCO. Below 40% utilization, cloud is more cost-effective. Many enterprises use hybrid: owned hardware for baseline, cloud for peak demand.

What is the cheapest way to rent an H100 GPU?

As of 2026, H100 cloud pricing ranges from $2.23/hr (Lambda, RunPod spot) to $4+/hr (AWS, Azure on-demand). Reserved instances and spot pricing offer 30-60% savings. CoreWeave and Lambda typically offer the lowest rates.

What GPU is best for LLM training in 2026?

NVIDIA H200 SXM (141GB HBM3e) for proven clusters, B200 for next-gen 4-5x speedup over H100, or AMD MI300X (192GB) for budget-conscious teams. For JAX workloads, Google TPU v5p pods offer unmatched scale.

Blog/Data Center

Data Center2026-05-0115 min read

GPU Power & Cooling Planning for Enterprise Data Centers in 2026

How to plan power delivery and cooling for modern GPU clusters. B300 Ultra (1000W), MI355X (1400W), and H100 (700W) power requirements, cooling options, and facility upgrade costs.

The most common reason GPU deployment projects fail is not budget or technical complexity — it is underestimating the facility requirements. Modern high-performance GPUs draw between 700W and 1,400W each. Deploying 64 of them means managing 45–90kW of heat generation in a space no larger than a standard server room. This does not happen with air cooling.

This guide is for data center managers, facilities teams, and infrastructure architects planning GPU cluster deployments in 2026. We will cover power delivery, cooling options, costs, and practical recommendations based on real deployments.

Power Requirements by GPU

GPU	TDP (W)	8-GPU Node (W)	8-GPU Node (kW incl. overhead)
NVIDIA H100 SXM5	700	5,600	~7.5 kW
NVIDIA H200 SXM	700	5,600	~7.5 kW
AMD MI300X	750	6,000	~8.0 kW
NVIDIA B200	1,000	8,000	~10.5 kW
NVIDIA B300 Ultra	1,000	8,000	~10.5 kW
AMD MI355X	1,400	11,200	~14.5 kW

The "overhead" includes server chassis, CPU, memory, storage, and networking — typically 30–40% on top of GPU TDP. A single 8-GPU MI355X node requires the same power as 20 standard 1U servers. Plan accordingly.

Cooling Options and When to Use Each

Air Cooling (Traditional CRAC/CRAH)

Air cooling works for deployments up to approximately 15–20kW per rack in well-designed facilities. This means you can air-cool a rack of H100 or H200 nodes (two 8-GPU nodes per rack at ~7.5kW each = 15kW/rack) in a modern data center with hot aisle/cold aisle containment.

For B200, B300 Ultra, or MI355X, air cooling a full rack becomes problematic. You either limit to one 8-GPU node per rack (wasteful) or accept increased cooling costs and potential thermal throttling during warm ambient conditions.

Use air cooling when: Deploying H100/H200 nodes in an existing facility, rack density ≤ 20kW, or when liquid cooling infrastructure investment is not justified by deployment size.

Rear-Door Heat Exchangers (RDHx)

RDHx units mount to the back of a rack and use chilled water to capture heat before it enters the room. They can handle 20–40kW per rack without requiring floor modifications — just chilled water supply to each rack. Cost: $5,000–$15,000 per rack for the RDHx unit, plus chilled water piping.

RDHx is the lowest-friction upgrade path for existing air-cooled data centers that need to add GPU density. It does not require changes to the server hardware itself and works with any GPU.

Use RDHx when: Upgrading an existing air-cooled facility for higher-density GPU nodes, 20–40kW per rack targets, budget-constrained facilities that cannot do a full liquid cooling build-out.

Direct Liquid Cooling (DLC)

Direct liquid cooling routes cold plates directly to GPU components — either via manifolds inside the server chassis or via facility-level CDUs (Coolant Distribution Units). DLC handles 40–100kW+ per rack and is now standard for B200, B300 Ultra, and MI355X deployments.

NVIDIA and AMD now provide DLC-capable reference server designs for their flagship GPUs. NVIDIA's HGX B200/B300 modules support DLC via integrated coolant manifolds. AMD's MI355X OAM modules similarly support direct liquid cooling.

Facility cost for DLC: $30,000–$80,000 per rack for CDU, piping, and installation. At scale (20+ racks), this represents $600K–$1.6M in cooling infrastructure on top of GPU hardware costs. Budget for this explicitly — it is almost always underestimated.

Use DLC when: Deploying B200, B300 Ultra, or MI355X at any significant scale. Targeting rack densities above 40kW. Building a new GPU-optimized facility.

Immersion Cooling

Full immersion (servers submerged in dielectric fluid) handles effectively unlimited heat density and can achieve PUE below 1.03 — nearly all input power goes to compute. Infrastructure cost is $50,000–$150,000+ per tank, and server hardware must be compatible (standard servers can be immersed but require stripping plastics and fans).

Immersion is appropriate for hyperscale deployments (1,000+ GPUs) where the PUE savings and density benefits justify the infrastructure investment. For deployments under 500 GPUs, DLC with standard CDUs provides better economics.

Power Delivery Infrastructure

GPU clusters require three-phase power at high amperage. Planning notes:

PDU sizing: A 32A three-phase PDU at 208V provides ~11.5kW. A single MI355X node needs 14.5kW — you need 16A three-phase minimum per node, 32A recommended for headroom.
UPS sizing: Size your UPS for GPU TDP + 30% overhead + any network infrastructure. A 64-GPU H100 cluster requires ~500–600kW of UPS capacity.
Generator backup: For production AI infrastructure, generator backup is not optional. Size for full cluster load. Diesel generator lead times in 2025–2026 are 16–20 weeks — order early.
Power factor correction: Modern GPU servers have power factors of 0.95–0.99, but verify with your facility team. Low power factor inflates apparent power (kVA) requirements and can overload circuits.

PUE Targets and Energy Cost Modeling

Power Usage Effectiveness (PUE) measures how much total facility power goes to IT load vs cooling/other overhead. Target PUE by cooling technology:

Air cooling (modern facility): 1.3–1.5 PUE
Air cooling with hot aisle containment: 1.2–1.35 PUE
RDHx: 1.15–1.25 PUE
DLC: 1.05–1.15 PUE
Immersion: 1.02–1.05 PUE

Energy cost example for 64× H100 SXM5 cluster over 3 years at $0.10/kWh:

IT load: 64 × 700W × 8,760 hrs/yr × 3 yrs = 1,175 MWh = $117,500
With 1.3 PUE: total facility energy = $152,700
With 1.1 PUE (DLC): total facility energy = $129,250 — saves $23,450 over 3 years

At larger scales (256+ GPUs), DLC's energy savings pay back the infrastructure investment within 18–30 months.

Practical Checklist Before Ordering GPUs

Confirm available power capacity per rack and total facility in kW
Measure or calculate current PUE and headroom in cooling capacity
Confirm three-phase power availability at required amperage
Get quotes for cooling upgrades (RDHx or DLC) before finalizing GPU model choice
Verify that your network switches support the bandwidth required (400G InfiniBand for B300/MI355X)
Add 20% power headroom to all capacity calculations for thermal transients and future expansion
Plan for liquid cooling drain and fill procedures in your datacenter runbooks

The GPU hardware is often the most straightforward part of an enterprise GPU deployment. The facility work — power, cooling, networking — determines whether the deployment succeeds on schedule and within budget. Start the facility assessment before you order the GPUs.

GPU coolingliquid coolingdata center powerPUEB300MI355Xenterprise infrastructure

Try Our GPU Tools

Compare GPUs, calculate TCO, and get AI-powered recommendations.

Data Center GPUs More Articles

NVIDIA B300 Ultra vs AMD MI355X: A Deep-Dive into the 2026 Data Center GPU Battle

2026-03-15 · 18 min read

Choosing the Right GPU for LLM Training in 2026: A Practitioner's Guide

2026-03-12 · 20 min read

GPU Cloud Pricing in 2026: We Compared 7 Providers So You Don't Have To

2026-03-10 · 15 min read