Inference-Time Data Exposure: What Happens to Your Data When the Model Runs

The Question

A law firm deploys an internal AI assistant for matter research. Associates submit natural language queries — "summarize the precedents for punitive damages in product liability cases in the Southern District of New York" — and the system retrieves relevant documents from the firm's internal knowledge base and generates a synthesized response. The system has been running in production for four months. The managing partner asks the CISO: what happens to client data when an associate submits a query?

The honest answer requires tracing a data flow that most organizations have not mapped. The query may contain client names, case details, or confidential legal strategy. It is submitted to a hosted inference endpoint — either an external API or an internal deployment. For the RAG component, the query is converted to an embedding and used to search a vector database that contains embedded representations of the firm's document corpus, including client-privileged materials. The retrieval system returns the most semantically similar document chunks, which are assembled into a context window. The LLM generates a response incorporating both the query and the retrieved context. The response is delivered to the associate and may be copied into emails, memos, or client communications.

At each step, data that is subject to attorney-client privilege, work product protection, and professional conduct rules is flowing through a system that was not designed with those data sensitivity requirements in mind.

Inference-time data exposure is not a theoretical risk — it is a data flow that occurs with every user query, and enterprises deploying AI systems without a data flow map of their inference pipeline are operating with unknown data exposure.

Why This Matters Now

In October 2024, a major financial services firm disclosed in an SEC filing that an internal AI assistant deployed for analyst research had surfaced material non-public information (MNPI) to analysts who did not have authorized access to that information. The system used a RAG architecture that retrieved from a document corpus including earnings preview materials that were subject to information barrier controls. The retrieval system did not enforce those information barriers — it retrieved based on semantic relevance, not authorization. The disclosure triggered an SEC inquiry into the firm's information barrier controls.

This incident is the first publicly documented case of an AI system's RAG retrieval mechanism causing an information barrier violation in a regulated financial institution. It will not be the last. The architectural pattern — a RAG system that retrieves from a document corpus with mixed sensitivity levels, without authorization-aware retrieval — is the standard pattern for enterprise RAG deployments.

The regulatory context amplifies the risk. The SEC's 2024 interpretive guidance on AI use in broker-dealers specified that information barrier obligations apply to AI systems in the same manner they apply to human analysts — an AI assistant that surfaces MNPI across an information barrier creates the same regulatory violation as an analyst doing the same thing. FINRA followed with examination priorities that specifically called out AI system information barrier controls as a 2025 examination focus area.

The EU AI Act's requirements for high-risk AI systems include logging of system inputs and outputs sufficient to enable post-hoc review of decisions — a requirement that presupposes inference-time logging infrastructure that most enterprise AI deployments do not have. The GDPR right of erasure creates additional inference-time complexity: if a data subject exercises erasure rights for data that has been embedded in a RAG vector database, the organization must be able to identify and remove those embeddings — a capability that requires architectural design, not retroactive remediation.

What the CURVE™ Data Shows

The 2026 Stackcurve Data Security for AI CURVE™ Report evaluated inference-time security controls across prompt filtering, retrieval access control, cross-user isolation, third-party API governance, and output monitoring.

Nightfall AI leads in semantic prompt and output filtering, with real-time classification of LLM inputs and outputs using transformer-based models. Their API integrations cover OpenAI, Anthropic, Google Gemini, and the major LLM frameworks. Knostic is the specialist in authorization-aware retrieval — their product enforces access controls at the retrieval layer of RAG systems, ensuring that retrieved context respects the query user's document authorization level. This is the specific capability that would have prevented the financial services MNPI incident. Lasso Security provides LLM application firewall capabilities including prompt injection detection and session isolation monitoring. Securiti.ai's AI Data Command Center includes inference governance as a module, with data flow visibility across inference pipelines. Aporia provides ML monitoring with inference-time anomaly detection, flagging unusual query patterns that may indicate extraction attacks or boundary testing.

The CURVE™ assessment found that authorization-aware retrieval — the most critical inference-time control for enterprises with sensitive document corpora — is provided by the fewest vendors and understood by the fewest security buyers. Organizations deploying RAG systems on sensitive document corpora without evaluating Knostic or equivalent capability are carrying specific, documentable regulatory risk.

The full vendor rankings are in the 2026 Stackcurve Data Security for AI CURVE™ Report — free to download.

The Gap Most Buyers Miss

Inference-time security is the most actively discussed domain in AI security, but discussion has concentrated on prompt injection attacks — adversarial inputs designed to override system instructions — at the expense of the data exposure risks that occur through normal, non-adversarial usage.

The query data exposure gap. Every enterprise AI query is a data transmission event. Associates, analysts, engineers, and business users include real business context in their queries — customer names, financial figures, strategic plans, personnel details. If the inference endpoint is an external API, this data is transmitted to a third-party cloud environment. Even with enterprise data processing agreements in place, the organization has a compliance obligation to understand what personal data is being transmitted and on what legal basis. Most organizations have not conducted this analysis. The CURVE™ assessment found that fewer than 20% of enterprise AI programs include prompt DLP that classifies query content for sensitive data categories before transmission to inference endpoints.

The cross-user context contamination gap. LLM inference systems that serve multiple users from a shared application layer must implement session isolation: the context window for User A's session must not persist into User B's session. Failures in session isolation — which have been documented in production systems under load conditions — create a category of data exposure that has no analogue in traditional applications. An enterprise AI system that briefly serves User B a response containing context from User A's query has committed a data breach, potentially involving sensitive business information, from a failure mode that most security teams are not monitoring for.

The indirect prompt injection gap. In RAG systems, adversarial actors who can influence the content of documents in the retrieval corpus can inject instructions that the LLM will execute when those documents are retrieved. An attacker who can place a poisoned document in the internal knowledge base — through a phishing-delivered file that gets indexed, or through manipulation of a document that the system automatically ingests — can execute instructions in the LLM's context without direct user interaction. Indirect prompt injection is a documented attack vector for RAG systems and one that most enterprise AI deployments have not specifically addressed.

The third-party API data processing gap. OpenAI, Anthropic, Google, and the major LLM API providers have enterprise data processing agreements that include commitments not to use API-submitted data for model training and to maintain data confidentiality. But the existence of these agreements does not mean organizations have signed them. The CURVE™ assessment found a significant proportion of enterprise AI deployments using third-party LLM APIs under developer terms of service rather than enterprise agreements — meaning the enterprise data processing protections do not apply. This is a gap that can be closed in hours but that is surprisingly common.

Questions Your Buying Team Should Be Asking

1. For every external LLM API in your AI stack, is there a signed enterprise data processing agreement in place — and has legal reviewed the DPA for compliance with your data governance obligations?

This is a governance question, not a technical one, but it has immediate technical security implications. The enterprise DPA governs what the API provider can do with data submitted in queries. Without it, you are operating under developer terms that provide minimal data protection. A complete inventory of LLM API integrations and their contract status should be standard practice for any enterprise AI program.

2. In your RAG deployments, does retrieval respect the query user's document access authorization — and what is the technical mechanism for enforcement?

The correct answer to this question describes a specific technical implementation: either an authorization-aware retrieval layer (Knostic or equivalent), a filtered retrieval approach where document metadata includes access control attributes that are evaluated at query time, or a segmented vector database architecture where users only query against corpora they are authorized to access. Any answer that does not include a specific technical mechanism is not a satisfactory answer.

3. Is prompt content being classified for sensitive data categories before submission to inference endpoints — and if not, what is the mechanism for identifying when sensitive data is being submitted to external APIs?

Semantic prompt classification, applied before query submission to external inference endpoints, is the control that catches the Samsung-category disclosure. It requires integration into the AI application layer, not the network layer, because it must understand query content, not just inspect traffic. Ask vendors for demonstration on realistic enterprise query samples.

4. What session isolation mechanisms are in place for multi-user AI deployments — and have those mechanisms been load-tested to verify they do not fail under concurrent usage conditions?

Session isolation failures are typically not discoverable through code review — they surface under production load conditions. Load testing specifically designed to probe context leakage between concurrent user sessions should be part of the security validation for any enterprise AI deployment serving multiple users.

5. What inference logging exists for your AI systems — specifically, are query inputs and model outputs logged with sufficient fidelity to support post-hoc audit of AI-assisted decisions?

This question addresses both security monitoring (detecting anomalous query patterns) and regulatory compliance (the EU AI Act's logging requirements for high-risk systems). The logging infrastructure should capture the full query, the retrieved context (for RAG systems), and the model output, with timestamps and user attribution. This is more logging than most enterprise AI deployments currently implement.

The Stackcurve Take

Inference-time data exposure is the most operationally immediate AI data security risk because it occurs with every production query. The risks are not speculative — the financial services MNPI incident, the documented session isolation failures, the Samsung-category disclosures — are all inference-time events that occurred in production systems operated by sophisticated organizations.

The remediation architecture for inference-time security has three components. First, authorization-aware retrieval that enforces document access controls at the retrieval layer. Second, semantic prompt and output classification that identifies sensitive data in queries and responses. Third, inference logging that captures the full data flow — query, context, output — with user attribution and timestamps.

None of these are experimental capabilities. Knostic provides the first. Nightfall AI provides the second. The logging infrastructure can be built on standard observability platforms with LLM-specific extensions. The gap is not the availability of the tools — it is the organizational prioritization required to deploy them before the incident that makes them obviously necessary in retrospect.

The organizations that get this right are not the ones with the largest AI security budgets. They are the ones that mapped their inference data flows before deploying production AI systems and made architectural decisions — authorization-aware retrieval, prompt classification, session isolation, inference logging — before those decisions became incident response actions.

The 2026 Stackcurve Data Security for AI CURVE™ Report covers inference-time security controls in depth, with specific vendor assessments for Nightfall AI, Knostic, Lasso Security, Securiti.ai, and Aporia across prompt filtering, retrieval access control, and output monitoring capabilities. Download it free →

← Back to Research Library

Stackcurve Advisory Briefs are independent research. No vendor pays for placement, tier assignment, or editorial influence. The CURVE™ methodology is disclosed in full at stackcurve.net/research/methodology.