The Google Remy Leak: AI Agent Stack Risk in 2026

Why Remy Is Different From Every Other AI Agent Leak

Three stories in two weeks. The New Stack has been running Google "Remy" coverage on a near-daily cadence through May and into early June: the initial disclosure, the OAuth scope discovery, and the front-page piece on June 3 asking why enterprise architects are rethinking their AI stack in response. The leak surface is the Gemini Workspace agent stack — Gmail, Drive, Calendar, and the agent invocation paths that connect them.

If you missed the earlier coverage, here is the short version: an internal research alias ("Remy") surfaced a cluster of related issues in how the Gemini agent stack mediates access to user data inside Google Workspace. The issues are not one bug. They are a pattern — the same shape of weakness showing up across three different layers of the same agent stack.

That pattern is the story. A single CVE gets a patch and a CVE number. A pattern forces you to rethink the architecture.

What the Remy cluster actually shows

The three reported issues share an underlying property: they all exploit the way the Gemini agent stack handles shared infrastructure. Specifically:

OAuth scope over-provisioning. When the agent is granted access to read a user's email, that OAuth scope is silently inherited by every downstream tool call the agent makes — including tools that have no business reason to read the user's email. A "summarize this PDF" tool ends up with read access to the inbox by default.
Calendar side-channel. A user creates a calendar event containing sensitive data (a draft M&A memo, an unreleased product name, a customer's SSN). The agent, with calendar read access, can read that data through a path that is not instrumented or audited, because the calendar API itself is not on the watch list.
Draft-state recovery. When the agent writes a draft email on the user's behalf and the user does not send it, the draft is persisted to the user's Drafts folder. Other tools that have read access to the Drafts folder — and the agent framework's own context-recovery mechanism for the next session — can read the draft content. Sensitive data that "never left the user's inbox" actually persists in recoverable form across sessions and tools.

None of these is novel in isolation. OAuth scope creep has been a known issue since 2014. Calendar side-channels have appeared in academic literature. Draft-state recovery was on the OWASP LLM Top 10 list in early 2025. What is new is that all three are present, simultaneously, in a single production agent stack shipped by the company with the largest security budget in the industry.

If Google cannot build an agent stack that does not leak through these vectors, the question every enterprise architect should be asking is: can we?

The Build-vs-Buy Calculus Just Changed

For most of 2024 and 2025, the build-vs-buy decision for enterprise AI agents came down to four questions:

Capability gap: Can an off-the-shelf agent do what we need, or do we need a custom tool integration?
Data residency: Can the vendor host our data in a region that meets our compliance requirements?
Cost predictability: Will the per-seat or per-call pricing model stay stable as we scale?
Vendor lock-in: How painful is it to switch if the vendor changes pricing, terms, or product direction?

The Remy cluster adds a fifth question, and it sits upstream of the others: Can the vendor's agent stack contain its own blast radius?

This is not a question about whether Gemini is good. It is a question about whether the architecture of the agent stack — any agent stack, including one you might build yourself — can prevent an issue in one tool from cascading into a leak of data the user never explicitly granted the failing tool access to.

Why "trust the vendor" is no longer a sufficient answer

The pre-Remy instinct for most enterprise architects was: "If we buy from a major vendor, we get enterprise-grade security." That instinct was already strained by incidents like the 2023 Snowflake customer data breaches and the 2024 Okta support system compromise. The Remy cluster breaks it for the agent category specifically, because the agent category is qualitatively different from SaaS.

SaaS applications have a defined attack surface: the API surface and the data store. You can audit them. You can run pen tests. You can sign DPAs that cover known data flows. Agentic AI does not have a defined attack surface in the same sense. The attack surface is the reasoning chain, which is dynamic, depends on inputs the user controls, and can call tools the user did not anticipate. You cannot pen-test a reasoning chain the way you can pen-test a REST API.

What you can do is architect the agent stack so that the blast radius of any single tool failure is contained. That is the architectural question Remy has moved to the front of the room.

What "Blast-Radius Reduction" Actually Means

The phrase "blast radius" comes from incident response. In the context of agentic AI, it refers to the maximum amount of damage a single tool call, prompt injection, or model failure can cause. A well-architected agent stack minimizes blast radius; a poorly architected one amplifies it.

Concretely, blast-radius reduction means three things: tool-broker isolation, per-tool audit logging, and credential scoping. Each is independent. None of them is sufficient alone. The Remy cluster shows what happens when all three are missing.

Tool-broker isolation

A tool broker is a middleware layer that sits between the agent's reasoning loop and the actual tool implementations. Instead of the agent calling a tool directly, it calls the broker; the broker authenticates, scopes, and forwards the call to the actual tool. The broker is also the layer that emits audit logs and enforces per-tool limits.

The point of broker isolation is that the agent's reasoning loop should never have the credentials or network access to call a tool directly. If the agent's context window is hijacked by a prompt injection attack, the attacker can ask the agent to call a tool, but the tool call still has to go through the broker. The broker can reject it, log it, scope it, or require human approval before forwarding.

Google's Remy cluster failed at this layer. The agent's reasoning loop has the OAuth scope to call any of the integrated tools, and the OAuth token is inherited by the entire agent framework, not delegated per-tool. A "summarize this PDF" call ends up with the same effective access as a "send this email" call.

Per-tool audit logging

Every tool invocation should produce an immutable audit log entry containing: the agent session ID, the user ID, the tool name, the full input parameters (or a redacted version if the parameters are sensitive), the timestamp, the result code, and the duration. The audit log should be written to a separate system from the agent itself — ideally a write-once store like S3 with object lock, or a dedicated SIEM.

Audit logs are not just for forensics. They are for runtime policy enforcement. If the broker sees that a "summarize PDF" tool is suddenly calling a "send email" tool with the same OAuth session, that is a signal — even if the call succeeded — that something is wrong. The broker should be able to halt the agent's session on that signal.

Per-tool audit logging is also what makes the build-vs-buy calculus more favorable to building. When you build, you control the broker layer. When you buy, you have to trust that the vendor implemented the broker layer correctly, and that they did not just have the agent framework call tools directly with inherited credentials. Most of them did not, because the per-tool broker is extra engineering work that does not show up in the demo.

Credential scoping

The third layer is credential scoping. The principle is: a tool's credentials should grant access to only the data that tool needs, for only the duration of the tool call, and only from the network location the tool runs in. Modern cloud IAM (AWS IAM with session tags, GCP workload identity federation, Azure managed identities) makes this achievable. Legacy SaaS APIs with broad static OAuth tokens do not.

Remy's OAuth over-provisioning is the canonical example of what happens when credential scoping is not done. A single token grants access to everything in the user's Workspace, and the token is shared across every tool the agent invokes. There is no way for the agent framework to say "this particular tool call should only see the PDF, not the inbox."

For a build-vs-buy decision: if you are building, you can implement credential scoping yourself. If you are buying, you are depending on the vendor to do it — and the vendor has strong economic incentives not to. Fine-grained scoping is engineering work; broad-scope OAuth tokens are easy to ship.

The Hardening Checklist

Here is the concrete hardening checklist for an enterprise AI agent stack, derived from the failures the Remy cluster exposed. None of these are theoretical. All of them are things you can implement this quarter.

Run every tool call through a broker, never direct. The agent framework should not have direct access to tool implementations. Every call routes through a middleware layer that authenticates the agent, scopes the call, and emits an audit log. If you are using a framework that calls tools directly (LangChain, AutoGen, raw MCP servers without a wrapper), wrap them.
Scope OAuth tokens per-tool, not per-agent. When a tool needs to call the Gmail API, it should get a Gmail-specific OAuth token with only the scopes the Gmail tool needs. When the PDF tool needs read access to a Drive file, it should get a Drive token scoped to that file. The agent's overall reasoning loop should not hold any persistent tokens at all.
Audit-log every tool call, with full input/output payload, to an immutable store. "Immutable" here means write-once, not "we promise not to delete it." Use S3 object lock, or a SIEM with retention policy, or a dedicated audit-log database with append-only schema. The audit log is your last line of defense in a post-incident review.
Add a per-tool blast-radius cap. Each tool should have a hard limit on what it can do in a single session: maximum number of tool calls, maximum bytes of data read or written, maximum number of distinct external services contacted. When the cap is hit, the agent session is halted and requires human re-authorization to continue.
Instrument side-channels explicitly. Calendar, drafts folders, notes apps, task lists, search history — these are all places where users store sensitive data that the agent's reasoning loop can read through an indirect path. Either remove agent access to these surfaces entirely, or wrap them in tools that explicitly redacts sensitive content types before returning them to the agent's context.
Test the reasoning chain, not just the API surface. Pen-testing the agent's REST API is necessary but not sufficient. You also need adversarial testing of the reasoning chain: crafted inputs designed to make the agent take tool-call actions the user did not intend. Red-team the reasoning chain the same way you red-team any other production system.
Decide your build-vs-buy answer per agent, not per vendor. The Remy cluster shows that a single vendor's agent stack can have both well-architected and poorly-architected components. Do not buy or reject an entire vendor based on one capability. Evaluate the specific agent stack you would deploy, against the specific data it would access, with the specific blast radius of failure. A low-risk internal agent (summarizing internal docs) has different requirements than a high-risk agent (sending customer-facing emails).

These seven items are the floor. If your agent stack cannot pass all seven, you are not ready to put it in production against sensitive data — regardless of whether you built it or bought it.

What the Affiliate Stack Looks Like

Tooling exists for every layer of the hardening checklist. The four platforms that cover the most surface area in the broker/audit/observability space as of mid-2026: Sentry for error tracking and tool-call exception capture, Datadog for full-stack observability including OTel-exported agent traces, Portkey for the LLM gateway layer with first-class per-call audit logging, and Arize Phoenix for agent-specific observability including reasoning step reconstruction.

Sentry: tool-call exception capture

Sentry has been quietly adding LLM-specific instrumentation throughout 2025 and 2026. As of mid-2026, the Python and Node SDKs capture token usage, model name, and prompt/completion pairs as breadcrumbs on every LLM call. The relevant feature for hardening: tool-call breadcrumb capture. Every time an agent invokes a tool through the broker, Sentry captures the call's parameters, the result, and the duration as a breadcrumb attached to the agent session. When a tool call fails or behaves unexpectedly, the breadcrumb chain tells you exactly what the agent was doing in the steps leading up to the failure.

Sentry does not provide the broker layer itself — you still need to write that middleware. What Sentry provides is the post-hoc reconstruction. For an enterprise team running agents in production, the reconstruction layer is what makes incident response tractable.

Datadog: full-stack observability for agent traces

Datadog's LLM Observability product, GA since late 2025, ingests OpenTelemetry-exported agent traces and gives you a flame-graph view of every reasoning step, tool call, and model invocation. The relevant feature for hardening: tool-call chain attribution. Datadog groups tool calls by the agent session that initiated them, and lets you query for "all sessions where the calendar tool was called after the email tool" or "all sessions where a single tool received more than 100 calls." Those queries are how you catch the kind of side-channel exploitation Remy demonstrated.

Datadog is the expensive option. For teams already paying for Datadog infrastructure monitoring, the marginal cost of adding LLM Observability is reasonable. For teams that are not, the cost may push you toward self-hosted OpenTelemetry + Grafana Tempo, which we cover in other articles.

Portkey: LLM gateway with per-call audit logging

Portkey is the closest thing to a turnkey broker layer for LLM calls. It sits between your agent framework and the LLM provider (OpenAI, Anthropic, Gemini, etc.) and gives you per-call audit logging, cost attribution, retry/fallback logic, and rate limiting. The relevant feature for hardening: per-call audit log with prompt/response payload capture, stored in Portkey's own backend or exported to your SIEM.

Portkey is most useful for the gateway layer. It does not, as of mid-2026, wrap arbitrary tools (only LLM API calls). For the tools that hit external SaaS APIs (Gmail, Calendar, Salesforce), you still need to write your own broker. But for the LLM calls themselves, Portkey gives you 80% of the broker-layer benefits with 20% of the engineering work.

Arize Phoenix: agent-specific observability

Arize Phoenix is the platform most directly relevant to the Remy hardening discussion. Phoenix's open-source tracing library captures not just tool calls but the agent's reasoning chain — every prompt sent to the model, every completion received, every tool invocation and result, in a single trace tree. The relevant feature: reasoning step reconstruction, which lets you replay an agent's decision-making after the fact to understand why it took the actions it did.

Phoenix is the only one of the four that directly addresses the "reasoning chain" attack surface. The others (Sentry, Datadog, Portkey) treat the LLM call as a black box and capture the I/O. Phoenix captures the I/O and the chain of intermediate steps the agent took to produce it. For an enterprise trying to answer the question "did the agent act as the user intended?" — which is the question Remy forces every enterprise architect to ask — Phoenix is the most directly applicable tool.

Recommended Tool Sentry

Sentry captures token usage, prompt/completion pairs, and tool-call breadcrumbs on every LLM and tool invocation. The post-hoc reconstruction layer for agent incident response.

Recommended Tool Datadog LLM Observability

Full-stack observability for agent traces via OpenTelemetry export. Tool-call chain attribution, reasoning step flame graphs, and session-level query for side-channel patterns.

Recommended Tool Portkey

LLM gateway layer with per-call audit logging, cost attribution, and retry/fallback logic. The closest turnkey broker for the LLM-call half of the hardening checklist.

Recommended Tool Arize Phoenix

Open-source agent observability — reasoning step reconstruction, tool-call chain capture, and evaluation framework. The only platform that captures the full reasoning chain, not just the I/O.

What "Build" Actually Buys You

The case for building your own agent stack — at least the broker and observability layers — gets stronger after Remy. A built stack lets you enforce every item on the hardening checklist. A bought stack, even from a major vendor, gives you the vendor's interpretation of those items, which may or may not match your threat model.

The case for buying still exists. Building the agent framework itself — the reasoning loop, the planning logic, the tool selection — is a multi-quarter project for a team of senior engineers. Most enterprises should not build that themselves. The pragmatic answer: build the broker and audit layers yourself, buy the reasoning loop from a vendor. This is a 6-8 week project for a small platform team, not a 6-month project.

The OpenClaw and MCP ecosystems, both of which we cover in detail elsewhere, are moving toward this split. The vendor provides the reasoning loop and the tool-calling abstractions; the enterprise provides the broker layer, the audit logging, and the credential scoping. If you are starting an enterprise agent project in 2026, this is the architecture to plan around.

The Question Every Architect Should Be Asking

Remy is not a story about Gemini. It is a story about agent stacks — all of them, including the one you might be about to build. The leaks it surfaces (OAuth over-provisioning, calendar side-channels, draft-state recovery) are not Google-specific failures. They are architectural failures that show up in any agent stack that does not explicitly defend against them.

Before you deploy an agent into production against sensitive data, ask the vendor or your internal platform team these seven questions:

Does every tool call go through a broker, or can the agent call tools directly?
Are OAuth tokens scoped per-tool, or does the agent hold a single broad-scope token?
Is every tool call audit-logged, with full input/output payload, to an immutable store?
Does each tool have a hard blast-radius cap (call count, byte count, service count)?
Which side-channels (calendar, drafts, notes) does the agent have access to, and how is sensitive content redacted before it reaches the agent's context?
How is the reasoning chain adversarially tested before each release?
Per-agent, what is the worst-case data exposure if a single tool is fully compromised?

If the answer to any of those is "I don't know" or "we have not implemented that," you do not have a production-ready agent stack. You have a prototype. The difference between the two is the hardening checklist, applied rigorously, before any sensitive data is in scope.

The Remy cluster is the first time the industry has been forced to confront this in public. It will not be the last. The enterprises that get ahead of it now — by demanding the architectural answers from their vendors, or by building the broker layer themselves — are the ones whose agent deployments will still be trusted in 2027.

If you are evaluating AI agent stacks for enterprise deployment and want a structured framework for the build-vs-buy decision, subscribe to The Stack Pulse. Issues cover LLMOps, FinOps, and AI infrastructure patterns every Wednesday.