The Question
The first infrastructure decision in any enterprise LLM deployment is not hardware selection, model choice, or serving framework. It is a more fundamental question: where does the model live, who operates it, and under what contractual and technical controls does your data pass through it? The answer to that question determines your cost structure, your data privacy posture, your customization ceiling, and your operational burden for as long as the application runs in production.
Three deployment models are available: call a SaaS API (OpenAI, Anthropic, Google), use a managed cloud LLM service (Azure OpenAI, AWS Bedrock, Google Vertex AI), or self-host an open-source model on your own infrastructure (Llama 3, Mistral, Mixtral, Qwen, DeepSeek). Each model has a profile that suits specific organizational circumstances, and each model has failure modes that appear when organizations deploy it in circumstances it was not designed for.
The deployment model decision is frequently made by default rather than by analysis. Organizations with existing Microsoft relationships default to Azure OpenAI Service. Organizations with strong open-source engineering cultures default to self-hosting. Organizations with minimal AI engineering capacity default to SaaS APIs. These defaults are not wrong — but they should be validated against the specific requirements of each use case, particularly as AI programs scale and the cost and risk profile of the deployment model becomes more significant.
The LLM deployment model decision is not primarily technical — it is a business decision about data control, vendor dependency, and cost structure that should be made before any infrastructure investment.
Why This Matters Now
In November 2024, a mid-size financial services firm discovered that its customer-facing LLM application — built on the OpenAI API and deployed through an internal product layer — had been sending complete customer conversation transcripts, including account inquiry details and partial authentication information, to OpenAI's inference endpoints. The API integration had been built without reviewing OpenAI's data processing terms, which at the time of deployment did not guarantee that API request data would not be used for model training under certain conditions (terms that OpenAI has since updated multiple times). The firm's legal team, which had not been involved in the initial deployment, required an emergency architectural review when the data flow was surfaced during a routine compliance audit.
The remediation required migrating to Azure OpenAI Service — which provides the same GPT models under a Microsoft enterprise agreement with explicit data processing commitments and private endpoint configuration — within 90 days, at a cost that included both migration engineering and accelerated procurement of Azure capacity. The business case for Azure OpenAI had been known prior to the initial deployment; the decision to use the OpenAI API directly had been made by the engineering team for development velocity reasons without legal or compliance review.
This case is representative of a pattern that has appeared repeatedly across industries as LLM applications moved from internal tools to customer-facing production: the deployment model decision was delegated to engineering teams who optimized for development speed, without organizational review of the data governance and compliance implications at production scale. The result is a category of remediation cost that is entirely avoidable with upfront decision governance.
The EU AI Act's data governance provisions, which began applying to high-risk AI applications in August 2025, have added additional urgency to deployment model decision governance for organizations operating in European markets. Data localization requirements, processing transparency obligations, and human oversight mandates all have direct implications for which deployment model is permissible for specific application types.
What the CURVE™ Data Shows
The 2026 Stackcurve AI Infrastructure CURVE™ Report evaluated LLM deployment models and the vendors that serve each tier: SaaS APIs (OpenAI, Anthropic, Google DeepMind), managed cloud services (Azure OpenAI Service, AWS Bedrock, Google Vertex AI, IBM watsonx), and self-hosted open-source infrastructure (Llama 3, Mistral, Mixtral, Qwen, DeepSeek R1, with serving infrastructure from vLLM, TensorRT-LLM, SGLang).
Anthropic's API and Azure OpenAI Service scored highest among SaaS and managed service tiers respectively, with Anthropic scoring particularly well on safety and alignment documentation (relevant for regulated industries) and Azure OpenAI scoring highest on enterprise compliance certification breadth (FedRAMP, ISO 27001, SOC 2, HIPAA BAA availability).
AWS Bedrock scored strongly for organizations already committed to AWS, offering the broadest model selection of any managed service (Anthropic Claude, Meta Llama, Mistral, Amazon Titan, Cohere, AI21) under a single enterprise agreement and security perimeter.
In the self-hosted tier, Meta's Llama 3 family (8B, 70B, 405B) scored highest on model quality relative to operational requirements, with strong ecosystem support from virtually every inference serving framework. DeepSeek R1 scored notably for reasoning-intensive workloads with performance that matched frontier models on specific benchmarks at substantially lower self-hosting cost. Mistral's models (Mistral 7B, Mixtral 8x7B, Mistral Large) scored well on European data sovereignty use cases due to Mistral AI's French headquarters and available European cloud partnerships.
The full vendor rankings are in the 2026 Stackcurve AI Infrastructure CURVE™ Report — free to download.
The Gap Most Buyers Miss
Each deployment model has a profile of genuine advantages and genuine failure modes. The gap is that most organizations discover the failure modes reactively — at production scale, under cost or compliance pressure — rather than proactively in the deployment model decision process.
SaaS API: The Hidden Costs of Simplicity
SaaS APIs (OpenAI API, Anthropic API, Google Gemini API) offer genuine advantages: zero infrastructure, immediate access to frontier model quality, no MLOps team required, and per-token pricing that works well for variable-volume early-stage applications. The failure modes are less visible.
Per-token pricing is linear — it scales with usage and provides no economies of scale. At moderate production volume (500 million tokens per month — a reasonable figure for a mid-size customer service or document analysis application), OpenAI's GPT-4o pricing at $0.0025 per 1,000 output tokens generates approximately $125,000 per month in API costs. At that volume, self-hosted alternatives frequently break even within 12 months of infrastructure investment. Organizations that prototype on SaaS APIs without modeling scale economics frequently encounter budget pressure as they enter production.
Rate limits are a second failure mode. SaaS API rate limits (requests per minute, tokens per minute) are set per tier and can become a hard constraint on application scalability. Tier increases require application to the API provider and are not guaranteed at any specific timeline.
Data processing terms are a third failure mode — illustrated by the case above. These terms change over time, require active monitoring, and should be reviewed by legal and compliance before any production deployment involving non-public data.
Managed Cloud Service: The Premium for Compliance
Azure OpenAI Service, AWS Bedrock, and Google Vertex AI offer the same frontier models as SaaS APIs under enterprise agreements with explicit data processing commitments: data residency options, private endpoint configurations, no data used for model training, and the security certifications enterprises need for regulated industries (FedRAMP, SOC 2, HIPAA BAA, ISO 27001). This premium is meaningful and often justified — the compliance and legal overhead of evaluating SaaS API terms is replaced by the enterprise agreement with a trusted cloud vendor.
The failure modes: pricing is higher than SaaS APIs in most configurations (the enterprise agreement premium adds cost). Model availability lags SaaS release timelines — when OpenAI releases a new model, Azure OpenAI typically makes it available weeks to months later, as Microsoft runs its own safety review process. Model selection per platform is narrower than the SaaS tier; Azure OpenAI serves only OpenAI models, while AWS Bedrock and Google Vertex AI have multi-model selection but do not include every available model.
Self-Hosted Open Source: Full Control, Full Responsibility
Self-hosting an open-source model (Llama 3 70B, Mistral Large, DeepSeek R1, Qwen 2.5) on owned or dedicated cloud infrastructure offers full data control, no per-token cost (only compute cost), unlimited customization (fine-tuning, RLHF, model merging), and no vendor dependency for model serving. The cost economics are compelling at high volume: a 70B model served on 2x MI300X GPUs using vLLM with continuous batching can generate 800–1,200 tokens per second at approximately $4–6/hr in compute cost — orders of magnitude cheaper per token than frontier API pricing at that throughput.
The failure modes are real. Model quality for self-hosted open-source models is below frontier for complex reasoning, long-context tasks, and instruction-following on ambiguous prompts. The gap has narrowed substantially — Llama 3 405B and DeepSeek R1 have demonstrated frontier-competitive performance on specific benchmarks — but for enterprise use cases where model quality is the primary constraint, frontier APIs currently maintain an advantage.
Operational burden is the second failure mode. Self-hosting requires GPU infrastructure, an MLOps team capable of operating serving infrastructure, security responsibility for the model endpoint, reliability engineering (SLA, redundancy, failover), and model update and security patch management. Organizations without this capability should not underestimate its cost.
Questions Your Buying Team Should Be Asking
1. Has legal and compliance reviewed the data processing terms for every external LLM endpoint our applications use in production?
This is a governance question, not an engineering question, and it should be asked before deployment rather than after. The review should cover: which data categories are sent to the endpoint (PII, PHI, financial data, proprietary business information), what the provider's data processing agreement covers (data residency, training data opt-out, subprocessor disclosures), and whether the terms align with the organization's compliance obligations. For SaaS APIs, terms change periodically and should be reviewed on a scheduled basis, not just at initial deployment.
2. Have we modeled our per-token costs at 6-month and 18-month projected volume — and identified the volume threshold at which self-hosting breaks even?
Take your current API cost per 1,000 tokens, your projected monthly token volume at 6 and 18 months, and calculate monthly API cost at those volumes. Then model self-hosting alternatives: GPU infrastructure cost, serving software, and MLOps staffing against comparable quality open-source models. The break-even calculation is not complex — it is a spreadsheet exercise that most organizations should complete before committing to a SaaS API for production deployments with material scale projections.
3. For applications involving sensitive data, have we evaluated managed cloud service vs. SaaS API — not assumed that SaaS API is acceptable?
SaaS API deployments are often appropriate for applications with non-sensitive data or where the provider's standard terms satisfy compliance requirements. They are not automatically appropriate for applications involving PII, PHI, financial data, or proprietary business information. The managed cloud service tier (Azure OpenAI, AWS Bedrock, Google Vertex AI) exists specifically to address this gap. If the deployment involves sensitive data, the evaluation should start at managed service and require explicit justification to move to SaaS API, not the reverse.
4. What is our model update and version management strategy — specifically, how will we handle unplanned model changes from API providers?
SaaS and managed service API providers update models on their own schedules. When OpenAI updates GPT-4o, all applications calling the gpt-4o endpoint receive the new model version, potentially with changed output behavior, without being asked. This has caused regression incidents in production applications that were tuned to specific model behaviors. Version-pinned API endpoints (e.g., gpt-4o-2024-11-20) allow applications to specify an exact model version — use them for production applications and implement a tested migration process before accepting new versions.
5. If we are evaluating self-hosting, have we scoped the full operational requirement — infrastructure, serving software, security, reliability, and MLOps staffing — not just GPU hardware cost?
Self-hosting is frequently evaluated by comparing GPU compute cost against API pricing, then declaring the GPU option cost-competitive. This calculation omits: MLOps engineering headcount (typically 2–4 engineers for a production self-hosted deployment), serving infrastructure software operations, security responsibilities (endpoint security, model access controls, output monitoring), reliability engineering (SLA, redundancy, failover), and model lifecycle management (updates, security patches, version management). A complete total cost of ownership model including these factors frequently shows that self-hosting breaks even later than the compute-only calculation suggests — often at 18–24 months rather than 6–9 months.
The Stackcurve Take
The LLM deployment model decision is the highest-leverage infrastructure decision most enterprise AI programs make, and it is the one made with the least formal analysis. SaaS APIs are not universally wrong — they are the right starting point for many organizations. Managed cloud services are not unnecessarily expensive — for regulated industries, the compliance premium buys real legal protection. Self-hosting is not premature complexity — for high-volume, sensitive-data, or highly-customized applications, it is frequently the correct long-term answer.
The problem is not that any of these models is wrong. The problem is when they are chosen by default rather than by analysis. Deployment model decisions made with clear data — utilization projections, compliance requirements, quality benchmarks, cost models, operational capability assessments — consistently produce better outcomes over the 2–3 year horizon than decisions made by IT philosophy or development team convenience.
The 2026 Stackcurve AI Infrastructure CURVE™ Report covers LLM deployment models, SaaS API and managed service vendor assessments, and a decision framework for selecting the right deployment model for your organization's specific requirements. Download it free →
Stackcurve Advisory Briefs are independent research. No vendor pays for placement, tier assignment, or editorial influence. The CURVE™ methodology is disclosed in full at stackcurve.net/research/methodology.