Model Serving at Scale: The Infrastructure Failures Nobody Talks About

The Question

The demo ran perfectly. The Jupyter notebook, the internal Slack bot, the proof-of-concept that got the budget approved — all of them performed exactly as promised. The model was accurate, fast enough, and the stakeholders were impressed.

Then it went to production.

Within 72 hours, the first incident: a burst of concurrent users caused response times to spike from 800ms to 45 seconds. The on-call engineer restarted the serving container. It happened again the next morning. Two weeks later, a partial failure under load cascaded into a complete service outage that lasted six hours because nobody had implemented circuit breakers.

This story — or a close variant of it — is one of the most common AI infrastructure failure narratives that enterprise engineering teams share in post-incident reviews. The technical cause changes. The organizational cause is always the same: the team that built the model was not the team that understood distributed systems serving, and the organization did not bridge that gap before go-live.

Getting a model to work in a notebook is a data science problem. Getting a model to serve ten thousand concurrent users with sub-second P99 latency and 99.9% uptime is a distributed systems engineering problem — and those two problem domains require entirely different expertise and infrastructure.

Production model serving is a distributed systems engineering problem, not a data science problem — and organizations that treat it as an extension of model development consistently hit the same set of production failures.

Why This Matters Now

In Q2 2024, a major European retailer launched an AI-powered product recommendation system built on a fine-tuned Llama 2 70B model. The system had passed all internal testing and performed well under synthetic load tests that simulated up to 500 concurrent users. The production launch coincided with a promotional event that drove traffic to 8,000 concurrent users within the first hour.

The inference server — a single-instance deployment without autoscaling — began queuing requests. Response latency climbed from 1.2 seconds to over two minutes. The client application had no timeout handling, so connections piled up. Memory pressure caused the inference server to begin returning 503 errors. The client application interpreted errors as retryable and began retrying aggressively. The retry storm amplified the load, completing the failure.

The incident lasted four hours and affected approximately 180,000 users during a high-revenue promotional window. The root cause analysis identified five distinct engineering gaps: no horizontal scaling, no continuous batching, no request queue depth limits, no client-side timeout enforcement, and no circuit breaker between the application tier and the inference tier.

Each of these gaps represents a known, well-documented production serving failure mode. None of them are obscure. All of them were absent because the team's serving infrastructure had been designed by the model development team — experts in training and evaluation — rather than engineers with distributed systems production experience.

This incident type became frequent enough through 2024 and 2025 that NVIDIA, Hugging Face, and the vLLM project all significantly expanded their production serving documentation and best practices guides. The gap between demo and production serving had become a widely recognized industry problem.

What the CURVE™ Data Shows

The 2026 Stackcurve AI Infrastructure CURVE™ Report evaluated model serving infrastructure across four dimensions: throughput efficiency, operational maturity, hardware optimization, and ease of enterprise deployment. Vendors and frameworks assessed include NVIDIA Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, Hugging Face Text Generation Inference (TGI), BentoML, and Seldon Core.

The CURVE™ analysis found significant differentiation in how serving frameworks handle the five primary production failure modes — memory fragmentation, batching efficiency, cold start management, failure cascade handling, and multi-node tensor parallelism. Organizations running GPU-optimized hardware (H100, A100) achieved dramatically different throughput results depending on whether they used framework-native optimizations (TensorRT-LLM, vLLM) versus generic serving infrastructure.

The report also benchmarked real-world throughput for Llama 3 70B and Mixtral 8x7B across serving frameworks at three concurrency levels (100, 1,000, and 10,000 concurrent requests), providing an empirical basis for framework selection at different scale requirements.

The full vendor rankings are in the 2026 Stackcurve AI Infrastructure CURVE™ Report — free to download.

The Gap Most Buyers Miss

The failure modes in production model serving are well understood by the infrastructure engineering community. They are not well understood by the data science and ML teams who typically own AI deployments through the proof-of-concept phase. Here is where production breaks, and why.

Memory Fragmentation Under Variable Sequence Length

LLM inference is unusual among deep learning workloads because input and output sequences vary dramatically in length. A naive inference server allocates a fixed memory block for each request based on maximum possible sequence length. When requests complete at different times and new requests arrive with different lengths, GPU memory becomes fragmented — large blocks of allocated but unused memory accumulate, and the server rejects new requests due to apparent memory exhaustion while significant actual memory is theoretically available.

vLLM's PagedAttention mechanism, published in 2023, specifically solves this problem by managing GPU memory as a paged virtual memory system — the same technique operating systems use to manage CPU memory fragmentation. Without PagedAttention or an equivalent mechanism, a serving deployment may effectively use only 40–60% of available GPU memory for active inference work.

Batching Efficiency

Naive inference serves one request at a time, processing it start to finish before accepting the next request. This approach leaves GPU cores idle during memory transfer operations and wastes parallelization capacity. Continuous batching — implemented in vLLM, TGI, and TensorRT-LLM — interleaves multiple requests through the inference pipeline, achieving 5–10x throughput improvement over naive sequential serving. Organizations that deploy models without continuous batching are effectively wasting 80–90% of their GPU capacity at any given moment under concurrent load.

Cold Start Latency

A Llama 3 70B model at FP16 precision requires approximately 140 GB of storage and 60–120 seconds to load from disk into GPU memory. An inference server that does not pre-load models will impose that latency on the first request after any restart — and on any burst traffic event that triggers autoscaling of new replicas. Proper serving infrastructure pre-loads models during replica initialization and manages replica count proactively rather than reactively.

Cascading Failures and Retry Storms

Inference servers under memory or compute pressure begin returning errors. Client applications without proper backpressure handling interpret errors as transient and retry immediately. Retry volume amplifies load on an already-stressed server, accelerating the failure. This pattern — a retry storm — is one of the most common causes of complete AI service outages that begin as partial degradation. Circuit breakers (Hystrix pattern, implemented in service mesh or application code) and exponential backoff with jitter in client retry logic are the standard mitigations. Both are frequently absent in AI serving deployments built by data science teams.

Tensor Parallelism and Inter-Node Networking

Models too large to fit on a single GPU (70B+ parameters at FP16) require tensor parallelism — splitting the model across multiple GPUs for inference. This creates a hard dependency on inter-GPU communication bandwidth. On NVIDIA hardware, NVLink (within a node) or InfiniBand (between nodes) provides the necessary bandwidth. Standard Ethernet — even 100GbE — introduces communication overhead that exceeds compute latency and degrades end-to-end inference performance by 3–5x compared to NVLink. Cloud deployments must select instance types with appropriate interconnects (p4d.24xlarge, p5.48xlarge on AWS) rather than simply provisioning the GPU count required.

Questions Your Buying Team Should Be Asking

1. Is the team responsible for production serving the same team that built the model, and do they have distributed systems production experience?

The single most predictive indicator of production serving reliability is whether the team owning the deployment has operated distributed services under production load before. Data science expertise does not transfer to distributed systems engineering. If the model team owns production serving, assess whether they have the specific background — or whether platform engineering needs to own the serving layer.

2. Does our serving infrastructure implement continuous batching, and what throughput benchmark validates it?

Request a specific throughput benchmark from your serving team or vendor: requests per second at P95 latency under target concurrent load. If they cannot produce this number, the serving infrastructure has not been validated at scale. Ask specifically whether continuous batching is enabled and which framework implements it.

3. What happens when the inference server starts returning errors? Is there a circuit breaker between the application tier and the serving tier?

Walk through the failure scenario explicitly. If the inference server begins returning 503 errors, what does the application tier do? Does it retry? With what backoff logic? Is there a circuit breaker that stops request flow to a failing replica? The absence of a clear answer to this question is a strong signal that the failure cascade scenario has not been engineered for.

4. How are new replicas initialized during autoscaling events, and what is the latency impact of cold starts?

Ask your team to demonstrate an autoscaling event: trigger a load spike that causes a new replica to be provisioned, and measure the latency experienced by requests during replica initialization. If cold start latency is not acceptable, ask how model pre-warming is implemented.

5. For models requiring tensor parallelism, what interconnect is provisioned between GPU nodes, and has end-to-end inference latency been benchmarked on that interconnect?

This question is relevant for any model requiring more than one GPU for inference (typically 70B+ parameters at FP16, or 13B+ at FP32). Ensure the infrastructure team can demonstrate that NVLink or InfiniBand is in use and that latency benchmarks reflect the actual inter-node networking, not a single-node test.

The Stackcurve Take

The engineering community has known about these failure modes for years. vLLM was released in 2023 specifically to address memory fragmentation and batching inefficiency. Circuit breakers have been a production distributed systems standard since Netflix published the Hystrix library in 2013. Cold start management is a solved problem in serverless infrastructure with decades of operational history.

The gap is not knowledge. It is organizational. AI model development teams are optimized for research velocity, experimentation, and model accuracy — not for production reliability engineering. When those teams carry a model from training through to production without a handoff to engineers with production distributed systems experience, the known failure modes manifest in predictable sequence.

The solution is not to retrain data scientists as infrastructure engineers. It is to treat production model serving as a first-class infrastructure engineering concern, staff it accordingly, and select serving frameworks (vLLM, NVIDIA Triton, TensorRT-LLM) with the operational maturity and documentation depth that enterprise production requirements demand.

The 2026 Stackcurve AI Infrastructure CURVE™ Report covers production model serving frameworks, throughput benchmarks, and operational maturity assessments across the leading serving infrastructure options. Download it free →

← Back to Research Library

Stackcurve Advisory Briefs are independent research. No vendor pays for placement, tier assignment, or editorial influence. The CURVE™ methodology is disclosed in full at stackcurve.net/research/methodology.