Data Residency and AI: When Your Model Training Violates Your Data Agreements

The Question

A German manufacturing conglomerate has a mature data governance program. EU customer personal data is stored in AWS eu-central-1 (Frankfurt). Data Processing Agreements with all major cloud providers are in place. Privacy impact assessments are conducted for new data processing activities. The DPO is experienced and well-resourced.

The conglomerate's AI team wants to fine-tune a large language model on internal engineering documentation, customer support transcripts, and product quality records to build an internal knowledge assistant. The ML engineers propose using AWS SageMaker for training — in us-east-1, because the SageMaker instances with the required GPU capacity are available there at lower cost and with shorter queue times than in Frankfurt. The training dataset will be assembled from the eu-central-1 data stores and transferred to us-east-1 for the training run.

The DPO is not consulted.

This scenario is not hypothetical. It describes, in its essential structure, the way that the majority of enterprise AI training projects are executed. The data residency compliance program that the conglomerate has invested in for years does not cover this activity because the activity did not exist when the compliance program was designed. The transfer of EU personal data to a US region for an ML training job is a cross-border data transfer under GDPR — indistinguishable in legal character from transferring that data to a US file server — but it does not look like a data transfer to the engineering team executing it.

Data residency violations in AI pipelines are not intentional — they are the consequence of AI programs being built by engineering teams without legal and data governance review of the specific data flows, and the regulatory exposure is identical to intentional violations.

Why This Matters Now

In January 2025, the Irish Data Protection Commission (DPC) — the lead EU supervisory authority for most major US tech companies — announced an investigation into the cross-border data transfer practices of a major AI company's European operations, specifically examining whether personal data of EU data subjects used in model training was transferred to the United States without adequate safeguards. The investigation was triggered by a complaint from the privacy advocacy organization noyb, which documented specific data flows between the company's European entity and its US training infrastructure.

This investigation follows a pattern of enforcement action that has progressively tightened cross-border transfer requirements. The invalidation of the EU-US Privacy Shield (Schrems II, July 2020) required all US-bound EU personal data transfers to be covered by Standard Contractual Clauses (SCCs) or another valid transfer mechanism. The EU-US Data Privacy Framework (DPF), adopted in July 2023, provided a new adequacy decision — but Max Schrems and noyb have already challenged the DPF in the Court of Justice of the EU, and a third Schrems ruling invalidating it is considered a realistic outcome by most EU privacy practitioners.

In practical terms: any enterprise that has EU personal data in its AI training pipeline and is conducting training on US infrastructure is carrying GDPR cross-border transfer risk. If the DPF is invalidated, organizations relying on it as their transfer mechanism will face the same disruption they experienced after Schrems II. The organizations that are building SCCs into their AI training data transfer architecture now — rather than relying on adequacy decisions — are managing this risk more conservatively.

The UK ICO's 2024 guidance on international transfers in the AI context added another layer of complexity. Post-Brexit, the UK's adequacy decision with the EU is separately negotiated and separately vulnerable. UK organizations with EU customer data in their AI training datasets face a two-leg transfer problem: EU data to UK (requires adequate safeguards), and UK data to US for training (requires UK international transfer mechanism).

What the CURVE™ Data Shows

The 2026 Stackcurve Data Security for AI CURVE™ Report assessed vendor capability for data residency governance in AI contexts across three dimensions: data residency awareness (does the product understand where data is, including in AI pipelines?), transfer control (can the product enforce residency constraints on data movement?), and documentation (does the product generate the compliance documentation required by regulators?).

Securiti.ai has the most comprehensive data residency capability in the AI context — their platform provides geographic data classification, cross-border transfer tracking, and automated DPA management across cloud environments. Their AI data module specifically addresses the training pipeline context. BigID provides strong data discovery with geographic tagging, enabling organizations to identify which data assets contain data subject to jurisdictional constraints — a prerequisite for residency-aware training dataset assembly. Immuta provides policy enforcement at the data layer, with the ability to create policies that restrict data access based on geographic attributes of the requesting compute environment — a technical mechanism for preventing cross-region training data transfers without policy approval. OneTrust covers the legal documentation side — DPA management, SCCs, transfer impact assessments — with AI-specific modules added in 2025 to address training data transfers specifically.

The CURVE™ analysis found that most organizations deploying AI are using data residency controls designed for traditional data environments and have not extended those controls to cover AI training pipelines. The technical gap (no controls on training job data access by region) is paired with a governance gap (DPO and legal teams not involved in training infrastructure decisions).

The full vendor rankings are in the 2026 Stackcurve Data Security for AI CURVE™ Report — free to download.

The Gap Most Buyers Miss

Data residency compliance for AI is not a single problem — it is four distinct problems that must be addressed at different points in the AI program lifecycle and by different organizational functions.

The training region selection gap. ML engineers make infrastructure decisions based on cost, availability, and performance. GPU capacity availability in the US typically exceeds EU availability for large training runs, and cost differentials can be significant. Data residency requirements are not in the typical ML engineer's decision criteria. The fix requires a process intervention: before training infrastructure is provisioned, a data residency check must occur that identifies the jurisdictional requirements of the training data and ensures the training infrastructure is in a compliant region. This is a governance process, not a technical control — but it can be supported by technical controls that block cross-region data transfers without approved documentation.

The third-party API processing gap. Enterprise use of external LLM APIs for fine-tuning — AWS Bedrock fine-tuning, Azure OpenAI fine-tuning, Google Vertex AI fine-tuning — involves transferring training data to the API provider's infrastructure. The data residency of that infrastructure must align with the jurisdictional requirements of the training data. AWS, Azure, and Google all offer regional API endpoints and data residency commitments — but these commitments must be actively selected in the service configuration and documented in the relevant DPAs. Default configurations are not residency-preserving.

The vector database location gap. A RAG system's vector database contains embeddings — mathematical representations derived from source documents. Those embeddings are derived from the underlying documents and, in many cases, can be inverted or used to reconstruct the source content with high fidelity. This means the data residency requirements applicable to the source documents also apply to their vector representations. An organization that correctly stores EU customer correspondence in Frankfurt but embeds and indexes those correspondences in a US-hosted vector database has a data residency violation in its RAG infrastructure.

The training dataset composition gap. Enterprise AI training datasets are typically assembled from multiple internal data sources — CRM records, support tickets, product documentation, financial records, HR data — each potentially subject to different jurisdictional requirements. A training dataset assembled without a data residency assessment may combine EU personal data (GDPR), US employee data (various state privacy laws), healthcare data (HIPAA), and financial data (state-level requirements) into a single dataset with conflicting cross-border restrictions. Training that dataset on a single infrastructure region cannot simultaneously comply with all applicable requirements. The solution — segmented training datasets with separate infrastructure for segments subject to different jurisdictional requirements — is architecturally more complex but the only compliant approach for multinational enterprises.

Questions Your Buying Team Should Be Asking

1. For your AI training projects, what is the process by which the DPO or legal team reviews training infrastructure decisions — specifically, who approves the selection of training regions for workloads that involve personal data from EU or other restricted jurisdictions?

The absence of a defined process is the most common finding. In most enterprises, ML infrastructure decisions are made by data science and engineering teams without a legal review gate. Establishing a lightweight review gate — a data residency checklist that must be completed before training is provisioned — can close this gap without creating unacceptable friction in the AI development process.

2. For every LLM API fine-tuning service your organization uses, have you reviewed the data residency commitments in the service terms — and are you using regional endpoints that comply with your data governance obligations?

AWS, Azure, and Google provide regional endpoints for their LLM fine-tuning services, but using them requires explicit configuration. Default API endpoints are typically US-based or globally load-balanced. Ask your cloud vendors to confirm in writing the data residency of the fine-tuning infrastructure for each specific service, and ensure that configuration is reflected in your DPAs.

3. Are your RAG vector databases hosted in the same jurisdictional environment as the source documents they index — and does your legal team have documentation of the data residency position for vector embeddings derived from personal data?

This question often produces uncertainty on both the technical and legal sides. ML engineers may not know where the vector database is hosted. Legal teams may not have considered whether vector embeddings require the same residency treatment as source documents. The EU EDPB's guidance on embeddings (published in draft form in 2025) suggests that embeddings derived from personal data should be treated as personal data for residency purposes — though final guidance is pending.

4. Has your AI program conducted a Transfer Impact Assessment (TIA) for training data transfers to third countries — and is that TIA documented and available for regulatory review?

Transfer Impact Assessments are a post-Schrems II requirement for SCCs-based transfers that is inconsistently implemented for AI training pipelines. The TIA must assess the legal framework of the receiving country, the likelihood of government access to transferred data, and the adequacy of supplementary technical measures. For AI training data transfers to US infrastructure, the TIA must address Executive Order 14086 and the CLOUD Act as components of the US legal framework.

5. If the EU-US Data Privacy Framework is invalidated, what is your contingency plan for AI training data transfers that rely on the DPF as their transfer mechanism?

This is a scenario planning question, but it has operational urgency. The Schrems III challenge to the DPF is active before the CJEU. Organizations whose AI training data transfers rely on the DPF — rather than SCCs — are one court ruling away from a compliance emergency. SCCs are the more durable transfer mechanism. Organizations with material AI training data flows between the EU and US should maintain SCCs as a backup to the DPF regardless of the current adequacy decision.

The Stackcurve Take

Data residency compliance for AI is a problem of organizational process and governance more than a problem of technical capability. The technical controls — residency-aware data access policies, regional API endpoints, geographic data classification — exist and are available from mature vendors. The governance gap is the absence of processes that bring legal, privacy, and compliance functions into AI infrastructure decisions at the right moments.

The organizations that are managing this risk well share a common pattern: the DPO or privacy team has a defined touchpoint in the AI project lifecycle — typically at the training data assembly stage and at the infrastructure selection stage — where data residency requirements are assessed against the planned architecture. This does not require a full privacy review for every training run; it requires a lightweight checklist that surfaces residency issues before they become compliance problems.

The regulatory direction is unambiguous. EU supervisory authorities are actively investigating AI training data transfers. The UK ICO has AI as an examination priority. The FTC has engaged with AI data governance questions in enforcement actions. The question for enterprise AI programs is not whether regulatory scrutiny of AI training data residency will arrive — it is whether the organization's governance practices will be defensible when it does.

The 2026 Stackcurve Data Security for AI CURVE™ Report covers data residency governance for AI training and inference, with vendor assessments for Securiti.ai, BigID, Immuta, OneTrust, and additional platforms — including specific analysis of regional API capabilities from AWS, Azure, and Google Cloud. Download it free →

← Back to Research Library

Stackcurve Advisory Briefs are independent research. No vendor pays for placement, tier assignment, or editorial influence. The CURVE™ methodology is disclosed in full at stackcurve.net/research/methodology.