GPU vs. CPU vs. TPU: The Hardware Decision That Drives Everything Else

The Question

Walk into any enterprise AI planning meeting and the hardware conversation follows a predictable script: someone says "we need H100s," the procurement team gets a quote, and the discussion moves to budget. The reasoning behind the selection is rarely deeper than brand familiarity — NVIDIA is the name associated with AI, H100 is NVIDIA's current flagship, therefore the decision is made.

This is not entirely wrong. NVIDIA's H100 is genuinely the best general-purpose AI accelerator available for most large-scale training workloads. But it is not always the right choice, and for a growing number of enterprise use cases — particularly inference-heavy production deployments, cost-sensitive applications, or workloads running inside specific cloud ecosystems — it is not even in the running.

The hardware decision has downstream consequences that persist for 2–3 years: it determines your cost structure, your cloud portability, your software stack compatibility, your vendor support relationships, and the ceiling on performance for every AI application you build on top of it. Making it based on brand recognition rather than workload analysis is one of the most expensive mistakes an enterprise AI program can make.

The available options have also expanded substantially. AMD's MI300X has reached competitive parity with the H100 for inference workloads. AWS Trainium and Inferentia offer compelling economics within the AWS ecosystem. Google's TPU v5p and v5e are genuinely excellent for training on Google Cloud. Apple Silicon's M4 Ultra has emerged as a capable option for on-device and edge inference. The market is no longer a single-vendor decision.

Hardware selection locks in your cost structure, your cloud portability, and your performance ceiling for 2–3 years — and most enterprises make the decision based on NVIDIA brand familiarity rather than workload analysis.

Why This Matters Now

In October 2024, a well-documented case emerged when a mid-size technology company publicly shared its infrastructure cost analysis after migrating a portion of its inference workload from NVIDIA H100 instances on AWS to AMD MI300X instances through a colocation provider. The numbers were stark: inference cost per million tokens dropped by 41%, with latency characteristics that were equivalent or marginally better for their specific model size (a fine-tuned 34B parameter model). The migration required six weeks of engineering time to validate software compatibility and retune serving configurations.

The case attracted attention because it was one of the first publicly documented enterprise migrations demonstrating real-world MI300X economics at scale. AMD's competitive position in AI accelerators had been discussed in analyst circles since the MI300X launch in late 2023, but documented production deployments with cost data were sparse. That changed through 2025 as AMD secured significant deployments at cloud providers and hyperscalers.

Simultaneously, the NVIDIA B200 launched in 2024 with specifications that reshaped the inference economics calculus at the top end: 192GB HBM3e memory (versus 80GB on the H100), 4.5x H100 inference throughput in NVIDIA's own benchmarks, and substantially higher memory bandwidth — critical for serving large models where memory bandwidth, not raw compute, is the binding constraint. Early B200 deployments through 2025 showed particularly strong results for enterprises running 70B+ parameter models in production.

The combination of genuine AMD competition, new NVIDIA generations, and expanding custom silicon options means that the hardware evaluation process that made sense in 2022 — evaluate NVIDIA, pick H100 — is no longer sufficient for well-run enterprise AI programs.

What the CURVE™ Data Shows

The 2026 Stackcurve AI Infrastructure CURVE™ Report evaluated hardware accelerators across five dimensions: raw performance for training and inference workloads, memory capacity and bandwidth, software ecosystem maturity, total cost of ownership at enterprise scale, and multi-cloud/portability characteristics.

NVIDIA retains leadership in training workloads and software ecosystem depth. The CUDA software stack, the breadth of frameworks with first-class NVIDIA GPU support (PyTorch, JAX, TensorFlow, Triton Inference Server, TensorRT-LLM, vLLM), and NVIDIA's NIM microservices catalog represent a software moat that AMD, Google, and AWS have not yet matched. For organizations running diverse AI workloads — training, fine-tuning, and inference across multiple model architectures — NVIDIA's ecosystem breadth remains a genuine procurement advantage.

AMD's MI300X has achieved a strong second position specifically in inference-optimized deployments. Its 192GB HBM3 memory capacity (matching the B200 on capacity, if not bandwidth) allows entire large model weights to reside in GPU memory, reducing memory transfer overhead during inference. ROCm software maturity has improved substantially, and major inference serving frameworks — vLLM, TensorRT-LLM's ROCm port, SGLang — now support MI300X with production-quality stability.

Google's TPU v5p and v5e score exceptionally for training workloads within the Google Cloud ecosystem, with the important caveat that TPU portability is essentially zero — TPU workloads are written in JAX or TensorFlow and cannot be migrated to other accelerator types without significant re-engineering.

AWS Trainium 2 and Inferentia 2 score well specifically on cost efficiency within AWS-committed deployments. The performance-per-dollar metrics are compelling for high-volume inference and fine-tuning within AWS, but the software ecosystem is narrower than NVIDIA or AMD.

The full vendor rankings are in the 2026 Stackcurve AI Infrastructure CURVE™ Report — free to download.

The Gap Most Buyers Miss

The hardware decision is almost always made as a single question — "which accelerator?" — when it should be made as a workload matrix. Different AI workload types have different hardware requirements, and most enterprise AI programs run multiple workload types simultaneously.

Training large models from scratch

This is the most resource-intensive workload type and the one for which NVIDIA's position is most defensible. Large-scale training requires maximum memory bandwidth, fast GPU-to-GPU interconnects (NVLink within nodes, InfiniBand or NDR networking between nodes), and mature distributed training framework support. NVIDIA H100 SXM (80GB HBM3, 3.35TB/s memory bandwidth, NVLink 4.0 at 900GB/s bidirectional) and B200 (192GB HBM3e, 8TB/s memory bandwidth, NVLink 5.0) are the reference hardware for this workload. Google TPU v5p pods are competitive for JAX-native training at very large scale.

Fine-tuning pretrained models

Fine-tuning a 70B parameter model in full precision requires approximately 140GB of GPU memory for model weights alone, plus gradient and optimizer state memory — typically 420–560GB of total GPU memory for a standard AdamW training run. This requires either multi-GPU setups (2–4x H100 80GB using DeepSpeed ZeRO-3 or FSDP) or single-GPU runs on B200 (192GB) or AMD MI300X (192GB). For parameter-efficient fine-tuning methods (LoRA, QLoRA), memory requirements drop substantially and A100 80GB or even A10G instances become viable.

High-throughput production inference

Inference for large models (70B+) is memory-bandwidth-bound, not compute-bound. The GPU is waiting for model weights to be transferred from memory to compute units faster than it is waiting for the compute units themselves to finish work. This makes memory bandwidth the key metric, not TFLOPS. NVIDIA B200 (8TB/s), AMD MI300X (5.3TB/s), and H100 SXM (3.35TB/s) are the current leading options. AMD MI300X's large memory capacity (192GB) means full 70B model weights fit in a single GPU, which has additional latency advantages for certain serving architectures.

Cost-sensitive, high-volume inference

For applications where per-token inference cost is the primary constraint and model quality requirements are met by smaller models (7B–13B), AWS Inferentia 2, Google TPU v5e, and AMD MI300X (on AMD-partnered cloud providers) offer meaningfully lower cost per token than NVIDIA GPU instances. This tier is increasingly important as enterprises move from pilot to production and encounter the economics of serving millions of users.

Edge and on-device inference

Apple Silicon (M4 Ultra, M3 Max) has emerged as a genuinely capable option for on-device inference of models up to ~70B at moderate throughput. The unified memory architecture (up to 192GB on M4 Ultra) allows larger models than discrete GPU configurations of similar cost. For enterprises deploying AI capabilities on macOS devices or Apple Silicon servers, this is a legitimate consideration rather than a consumer curiosity.

Questions Your Buying Team Should Be Asking

1. Have we mapped our AI workloads to hardware requirements before issuing an RFP or requesting cloud pricing?

The workload-to-hardware mapping should precede procurement, not follow it. Define your workloads explicitly: which models, what parameter sizes, training vs. fine-tuning vs. inference, expected throughput, latency requirements, and utilization projections. Each workload type has different binding constraints — memory capacity, memory bandwidth, compute throughput, interconnect speed — and different hardware options are optimal for different constraint profiles. A procurement RFP issued without this analysis is optimizing for the wrong variable.

2. What is our sensitivity to vendor lock-in, and have we priced it?

NVIDIA CUDA-based workloads can run on any cloud with NVIDIA GPU instances or on-premises NVIDIA hardware. AMD ROCm workloads run on AMD hardware across multiple clouds. Google TPU workloads run exclusively on Google Cloud. AWS Trainium/Inferentia workloads run exclusively on AWS. The portability cost is real: a workload written for TPU requires significant re-engineering to move to another accelerator. Price the lock-in risk explicitly — what is the cost of migration if the vendor's pricing or availability changes materially?

3. Have we compared AMD MI300X pricing against H100 for our inference workloads specifically?

For inference-heavy production workloads, AMD MI300X has closed the performance gap with H100 substantially while maintaining a cost advantage in most procurement configurations. If your production workload is primarily inference (as opposed to training), a side-by-side benchmark of H100 and MI300X using your specific model, batch size, and sequence length configuration is worth the 2–3 weeks of engineering time. The cost difference at production scale can be material.

4. What software frameworks does our team currently use, and are they certified on our target hardware?

Hardware compatibility with AI frameworks is not binary. PyTorch and JAX run on NVIDIA, AMD, and Google TPU — but performance optimization libraries (TensorRT-LLM, Flash Attention, operator fusion optimizations) have varying levels of support across hardware vendors. Before committing to AMD ROCm or AWS Trainium, validate that every library your team depends on has a tested, production-stable release for that hardware. ROCm support for vLLM and SGLang has reached production quality, but hardware-specific performance tuning libraries may require additional validation.

5. What is our total cost of ownership model over 3 years, including power, cooling, software, and support — not just hardware acquisition cost?

High-density GPU hardware (NVIDIA DGX H100 systems, B200 systems) requires significant power and cooling infrastructure: a single DGX H100 system draws 10.2kW under load; a 10-system rack requires 40kW+ of power delivery and liquid or high-density air cooling. On-premises GPU infrastructure that does not account for datacenter readiness — power capacity, cooling upgrades, physical space — consistently delivers surprises in total cost. Cloud GPU pricing, while higher on a per-GPU-hour basis, bundles these infrastructure costs and should be modeled against the full on-premises TCO, not just hardware acquisition cost.

The Stackcurve Take

The accelerator market has moved from a single dominant vendor with no credible alternatives to a genuine multi-vendor competitive landscape within a period of approximately 24 months. AMD MI300X is a legitimate production choice for inference workloads. Cloud-native accelerators from AWS and Google deliver compelling economics within their respective ecosystems. The newest NVIDIA generation (B200) has reset the performance ceiling for organizations that require maximum throughput and can operate within the NVIDIA ecosystem.

This is good news for enterprise buyers — competition reduces cost and increases optionality. It is also an increase in decision complexity. The buyers who will extract the most value from this market are the ones who do the workload analysis first, understand their constraints at each layer of the stack, and evaluate hardware options against real performance requirements rather than brand familiarity.

The 2026 Stackcurve AI Infrastructure CURVE™ Report covers GPU, CPU, TPU, and custom silicon accelerator options with benchmark data, vendor assessments, and workload-matching guidance. Download it free →

← Back to Research Library

Stackcurve Advisory Briefs are independent research. No vendor pays for placement, tier assignment, or editorial influence. The CURVE™ methodology is disclosed in full at stackcurve.net/research/methodology.