From "AI Tools" to "Agent Sprawl": Why the Platform Is the Hard Problem
Two years ago, an enterprise "AI strategy" meant a procurement decision. Pick a model vendor, point a handful of teams at the API, write a usage policy in a shared doc, and move on. The blast radius of a bad prompt was small — a chatbot said something dumb, you reset the conversation.
That era is over. In 2026 the typical Fortune 500 enterprise is not running two or three internal agents. The NewStack's June 2026 editorial put a number on what platform teams are seeing in their queue:
Enterprise architects are asking "how do I host 200 internal agents from 12 teams on shared infrastructure safely?" — and the people asking are not the AI team, they are the platform team.
That reframe matters. The question is no longer "how does an agent work?" — that problem is essentially solved by frameworks like LangGraph, AutoGen, and CrewAI. The question is now a platform question: how do you let a dozen teams ship agents independently, share the same model gateway, share the same tool execution substrate, and still answer the auditor's question "where do your AI agents run and who has access?"
At the same time, SAP's Sapphire 2026 launch of an "AI Agent Hub" put a name on a problem most CIOs had been ignoring: AI sprawl. Agents are being deployed outside IT oversight the same way SaaS apps and shadow-IT projects were deployed in the 2010s. Marketing spins up a campaign agent. Customer success ships a ticket triage agent. Finance builds a reconciliation agent. None of them are on the same infrastructure, none of them share a credential model, and none of them produce the kind of audit trail a SOC 2 or ISO 27001 auditor will accept in 2027.
This article is a reference architecture for the people who have to clean this up: the platform engineers, SREs, and security architects who are being asked to build the agentic ops platform — the shared substrate that every team in the company can safely ship agents on top of.
The Five Platform Requirements
Before drawing boxes on an architecture diagram, you need a checklist of what the platform must do. Five requirements recur across every enterprise we have talked to about this in 2026. They are not nice-to-haves. Skip one and you will be re-architecting in twelve months.
1. Per-Agent Identity and RBAC
A "user logs in, gets a session" model is the wrong primitive for agents. Agents are long-lived, often run unattended, are spawned by services, and frequently act on behalf of other services rather than human users. The platform must treat every agent as a first-class identity with its own service account, its own credential scope, and its own role bindings.
In practice this means:
- Every agent gets a SPIFFE / SVID-style workload identity at spawn time, signed by the platform's identity control plane.
- RBAC is expressed per-agent, not per-user: a "refund-bot" agent has the
billing:readandbilling:refund:createpermissions, not the human permissions of whoever deployed it. - Agent identities are short-lived. A long-running agent gets fresh credentials on a renewal interval (e.g. 1 hour), not a static API key that lives in a config file for the lifetime of the deployment.
- Cross-team agent invocations are brokered through the platform. Agent A cannot call Agent B directly; it must do so through the gateway, which checks A's identity, B's accepted callers, and writes an audit record.
The key insight: agent identity is the foundation of every other requirement on this list. Without it, you cannot do per-agent rate limits, you cannot do per-agent cost attribution, you cannot do per-agent audit logging, and you cannot enforce blast-radius reduction. Get the identity model right and the rest of the platform gets dramatically simpler.
2. Per-Agent Rate Limits and Budgets
An agent is a cost center that thinks. Left to its own devices, a misconfigured agent can burn through your monthly OpenAI bill in an afternoon. We have seen production incidents where a re-planning loop in a customer-facing agent racked up five-figure API costs before the on-call engineer even got paged.
The platform needs to enforce three kinds of limits on every agent:
- Request rate: a hard ceiling on requests per second per agent. If a single agent is suddenly making 10x its normal request volume, the gateway should rate-limit it before it hits the upstream model provider.
- Token budget: a per-agent (and per-team, per-environment) cap on token spend over a rolling window. When an agent exceeds its budget, the gateway should fail its requests with a 429 and emit a metric, not silently pass them through to the model provider.
- Step / iteration count: a cap on the number of LLM turns, tool calls, or sub-agent invocations a single agent run can make. This is the "infinite loop" circuit breaker. Without it, a buggy tool result that causes the agent to re-plan forever will quietly exhaust every other limit you have set.
These limits should be configurable per-agent via a declarative manifest — a CRD if you are on Kubernetes, a YAML config if you are on a lighter substrate — and the gateway should hot-reload them without requiring an agent restart.
3. Audit Logging at the Gateway
Every model call, every tool call, and every cross-agent invocation must produce an immutable, structured log entry. This is the requirement that makes the platform defensible to a SOC 2 or ISO 27001 auditor, and it is the requirement that most "agent platforms" in 2026 get the worst.
The minimum audit record per call is:
- Agent identity (SPIFFE ID or equivalent)
- Human owner / on-call contact (resolved at agent-spawn time, not the deployer's personal email)
- Timestamp with monotonic clock for ordering
- Model used and provider
- Prompt hash (not the prompt itself — store the hash, the prompt body lives in a separate, more protected log)
- Completion hash and token counts (input, output, cached)
- Tool calls made: name, parameters, return value hash
- Cross-agent calls made: target agent, payload hash
The audit log is not the same as your observability traces. The trace is for debugging ("why did this run take 14 seconds?"); the audit log is for compliance ("show me every time agent X accessed customer PII in the last 90 days"). They have different retention policies, different access controls, and different downstream consumers.
Append-only storage is the right model. The audit log should be written to a write-once bucket or an append-only log (S3 with object lock, CloudWatch Logs with an immutable log group, or an OLTP store with a no-update / no-delete policy). It must be impossible for the platform team itself to retroactively alter the record.
4. Prompt-Injection Defense at the Platform Layer
Prompt injection is the only AI security threat that an enterprise platform team can actually defend against at the infrastructure layer. Application-level mitigations (output filtering, system prompt hardening) matter, but they are each team's responsibility and they will be inconsistent. The platform can and should enforce baseline defenses that every agent inherits.
The defense has three layers:
- Input sanitization at the gateway: the gateway inspects inbound user input and tool results for known injection patterns (instruction-override phrases, base64-encoded payloads, indirect-injection markers from retrieved documents) and either rejects, redacts, or quarantines the input before it reaches the agent's context. Different actions for different risk levels: a low-confidence match might add a warning to the agent's context, a high-confidence match should reject the request outright.
- Output filtering at the gateway: the gateway also inspects the agent's response for signs of an injection that succeeded — leaked system prompts, unexpected tool calls, output that looks like an attempt to influence downstream agents. This is your last line of defense before the response leaves the platform.
- Tool-broker isolation: agents do not call tools directly. They request tool invocations through the gateway, which validates that the requested tool call is within the agent's declared policy (the CRD/YAML manifest), that the parameters are within expected types and ranges, and that the call does not cross a trust boundary the agent should not cross. A customer-support agent cannot invoke a "delete S3 bucket" tool because that tool is not in its manifest. Even if prompt injection tricks the agent into trying, the gateway stops the call.
This is also where the platform's per-agent RBAC pays a second dividend: the same identity that controls rate limits and budgets controls which tools the agent is allowed to call. One mechanism, two enforcement points.
5. Blast-Radius Reduction via Tool-Broker Isolation
The last requirement is the one most teams do not think about until they have already had an incident. When an agent misbehaves — whether from a bug, a model regression, or a successful prompt injection — the platform should make the blast radius as small as possible, automatically.
Concretely:
- Network segmentation: every agent runs in its own network namespace, with default-deny egress. The platform opens only the egress paths the agent has declared in its manifest. If the agent tries to talk to an undeclared host, the connection is dropped and the platform emits a high-priority alert.
- Resource ceilings: per-agent CPU, memory, disk, and GPU caps. A runaway agent cannot OOM the host. A model that suddenly starts producing 10x the normal token output cannot exhaust the model server's KV cache.
- Sandboxed tool execution: any tool that can execute code, write files, or call external services runs inside a sandbox. The sandbox's blast radius is bounded by the sandbox itself, not by the agent. If a code-execution tool gets prompt-injected into writing a destructive shell command, the worst that happens is one sandboxed container is destroyed.
- Automatic revocation: when the platform detects anomalous behavior (budget exhaustion, repeated tool-call failures, suspicious tool call patterns), it should automatically revoke the agent's credentials and stop further work, then page the human on-call. The default should be fail-closed, not fail-open.
The pattern that ties these together is what we call tool-broker isolation: the platform inserts a broker between the agent and every external dependency — the model, the tools, the data stores, the other agents — and the broker enforces the agent's policy at every call. The agent thinks it is talking to the world; it is actually talking to a controlled gateway that decides, on every call, whether the world is allowed to talk back.
Reference Architecture: The Kubernetes Operator Pattern
The substrate that most enterprise platform teams reach for in 2026 is Kubernetes, because Kubernetes already provides a vocabulary for "long-running workload with a declared spec and a controller that reconciles actual state to desired state." Agent orchestration maps onto that vocabulary almost perfectly. An Agent is a Custom Resource; an AgentOperator is a controller that watches Agent CRs and reconciles the actual running pods, secrets, network policies, and gateway config to match the declared spec.
The reference architecture has five planes:
Control Plane
The control plane is the API surface platform engineers interact with. It is exposed as a Kubernetes API (Custom Resource Definitions) and as a higher-level control plane UI for non-Kubernetes teams.
- Agent CRD: declares the agent's identity, model, tool list, budget, rate limits, and egress policy. The platform team owns the CRD schema; the agent developer fills in the spec.
- AgentOperator: a controller that watches Agent CRs and reconciles running workloads to match. When an Agent CR is created, the operator generates a SPIFFE identity, creates a NetworkPolicy, configures a ServiceAccount with the right RBAC, deploys the agent runtime, and registers the agent with the gateway. When the CR is deleted, the operator tears down all of the above.
- Policy Engine: validates every Agent CR against the platform's policies before admitting it. A marketing agent that requests a 10x its team's token budget gets rejected at admission. An agent that declares a tool not in the company's approved-tool catalog gets rejected at admission.
Gateway Plane
The gateway plane is the runtime enforcement layer. It sits between every agent and every external dependency. In 2026 the most common implementations are:
- LLM gateway: Portkey, Helicone, LiteLLM, or the open-source AgentGateway project. Handles model routing, fallback, retry, caching, per-agent budget enforcement, and emits the audit log for every model call.
- Tool broker: a thin proxy (often built on Envoy or a custom gRPC service) that brokers every tool call. Validates the call against the agent's declared tool list, sandboxes the execution environment, and writes the audit record.
- Agent-to-agent bus: a message broker (NATS, Kafka, or a managed equivalent) that is the only sanctioned way for one agent to invoke another. The bus enforces authentication, rate limits, and audit on every cross-agent message.
Execution Plane
The execution plane is where the agent code actually runs. Two patterns dominate:
- Durable execution framework: Temporal or Inngest as the orchestrator. The agent's plan is expressed as a Temporal workflow or Inngest function; each LLM call or tool call is an activity. The framework handles retries, timeouts, durability across restarts, and human-in-the-loop pauses. This is the right pattern for agents that take minutes or hours to complete a single user request.
- Stateless worker pool: for short-lived agent calls (single-turn or few-turn), a Kubernetes Deployment with HPA scaling on gateway queue depth. The worker is ephemeral; all state lives in the gateway and the durable store.
Sandbox Plane
The sandbox plane isolates the side effects of tool execution. The dominant options in 2026 are:
- E2B: a managed sandboxed code execution service. Spin up a firecracker microVM per tool call, run untrusted code, return the result, destroy the VM. Latency is in the 100-200ms cold-start range, which is acceptable for most agent tool calls.
- Modal: a serverless platform with first-class support for sandboxed function execution. Useful when the tool execution is GPU-heavy (e.g. an agent that runs an image-generation tool).
- Fly Machines: lightweight, fast-booting VMs that can be used as one-shot sandboxes. More general-purpose than E2B, but you wire up the security boundaries yourself.
For most enterprise deployments, E2B is the right default for arbitrary code execution and Modal is the right default for GPU work. Fly Machines are a good fallback when you need more control over the sandbox environment than E2B provides.
Observability Plane
The observability plane is the substrate that makes the rest of the platform debuggable. Three layers, each with a distinct consumer:
- Distributed traces: OpenTelemetry spans from the agent runtime, the gateway, and the sandbox plane. Spans are stitched together by trace ID, so a single user request produces a single trace that crosses the agent, the LLM gateway, the tool broker, and the sandbox. Backend: Tempo, Jaeger, or a managed equivalent. Consumer: the on-call engineer debugging a slow run.
- Audit log: the append-only record described in requirement #3. Backend: S3 with object lock, or an append-only log store. Consumer: the auditor, the security team, the legal team.
- Operational metrics: token spend per agent, tool-call latency distribution, error rate by tool, sandbox failure rate, agent queue depth. Backend: Prometheus + Grafana, or a managed equivalent. Consumer: the platform team operating the system, the FinOps team tracking cost.
Portkey is an LLM gateway built for production: per-agent identity, per-team budgets, automatic fallbacks across 200+ models, and the audit-log primitives enterprise platforms need. Most agentic ops teams we talk to in 2026 run Portkey (or a comparable gateway) as the first component of their gateway plane.
The Compliance Angle: What Auditors Will Ask in 2027
SOC 2 and ISO 27001 auditors are catching up to AI agents faster than most enterprises are. The questions they are starting to ask on 2026 audits are not "do you have an AI policy?" — every company can produce a policy document. The questions are operational:
- "Where do your AI agents run?" Not "what cloud" — "which Kubernetes cluster, which namespace, which node pool, and who has admin access to that node pool."
- "Show me every access to customer PII by an AI agent in the last 90 days." Not "every access by a user" — by an agent. If you cannot produce this report, you cannot answer the auditor's question.
- "How do you revoke an agent's access in an emergency?" Not "how do you reset a user's password" — how do you kill an agent mid-run, across all its sessions, and revoke every credential it holds.
- "What is the change-control process for an agent's tool list?" When the marketing team adds a new tool to their agent's manifest, is that change reviewed, approved, and version-controlled the same way a production code change is?
- "How do you detect a compromised agent?" If an attacker takes over an agent's credentials, what alerts fire, and what is the automated response?
Most enterprises we have talked to cannot answer all five of these today. The ones who can built the agentic ops platform rather than letting agents proliferate on whatever infrastructure each team happened to have available. The platform is, in the most literal sense, the audit-defense layer.
The AI Sprawl Problem: A Modern Shadow IT
Shadow IT was the defining infrastructure problem of the 2010s. Marketing signed up for a SaaS tool without telling IT. The data lived outside the corporate compliance perimeter. The audit failed. The CIO got fired, or at least had a very bad quarter.
AI sprawl is the same problem in a new shape. SAP's framing at Sapphire 2026 was direct: enterprise AI is following the same adoption curve as enterprise SaaS, with the same governance lag. The teams that should be governing agent deployment — IT, security, platform — are not the teams that are deploying agents. The teams that are deploying agents — product, marketing, customer success, data science — are doing so on whatever cloud account or SaaS subscription is most convenient.
Every unmanaged agent is a compliance liability, a cost center no one is watching, and a potential prompt-injection target with no sandbox around it. The agentic ops platform is the answer to AI sprawl the same way the central IT portal was the answer to shadow IT in the 2010s: by making the sanctioned path easier than the unsanctioned one.
The mechanism is the same too. When the platform team ships a gateway that gives every team in the company per-agent RBAC, audit logs, and rate limits out of the box, the cost of going around the platform goes up. The cost of going through it goes down. Teams self-select to the platform not because they have to, but because the platform is the path of least resistance.
That is the goal. Not "we made it policy" — that never worked for shadow IT, and it will not work for AI sprawl. "We made the platform the obvious choice."
Helicone gives you per-agent cost attribution and budget enforcement without a custom gateway. Drop-in observability for any LLM-backed agent, with the per-team and per-environment rollups the FinOps team will ask for in their first 30 days of platform operation.
Temporal is the durable execution framework most agentic ops platforms reach for in 2026. Express an agent's plan as a workflow, each LLM call and tool call as an activity, and Temporal handles retries, timeouts, long-running durability, and human-in-the-loop pauses. The right substrate for agents whose work spans minutes or hours.
The Tool Stack: A Concrete 2026 Shopping List
Given the architecture above, the concrete tools an enterprise platform team is likely to pick in 2026:
Gateway Layer
- AgentGateway (open source, CNCF sandbox): a vendor-neutral LLM gateway. Good baseline if you want to avoid per-vendor lock-in.
- Portkey (managed): per-agent identity, budget enforcement, 200+ model providers, audit log out of the box. The right choice when you want to ship the gateway plane in a week rather than a quarter.
- Helicone (managed): strongest on per-agent cost attribution and budget alerting. Pairs well with Portkey or as a standalone observability layer.
- LiteLLM (open source): the most widely deployed self-hosted LLM proxy. Good fallback if you have a hard "no managed LLM gateway" constraint.
Durable Execution
- Temporal: the strongest production track record for long-running, stateful workflows. The default choice for agents that span minutes or hours.
- Inngest: a more developer-experience-focused alternative. Strong if your agent developers are coming from a serverless background and Temporal's worker model feels heavy.
Sandbox
- E2B: managed firecracker-based code execution sandboxes. Default for arbitrary code-execution tools.
- Modal: serverless GPU execution. Default when the agent's tools include ML inference or image generation.
- Fly Machines: lightweight VMs, more control than E2B, more operational burden than Modal. Use when neither managed option fits.
Observability
- Grafana Cloud: the default observability backend for most platform teams in 2026. Loki for logs, Tempo for traces, Mimir for metrics, all behind one Grafana UI. The audit log can live in Loki with an immutable log group policy.
- OpenTelemetry: the instrumentation layer, not the backend. The agent runtime, the gateway, and the sandbox plane all emit OTel spans. The OTel collector ships them to whichever backend you pick.
E2B is the default sandbox for agent code-execution tools in 2026. Firecracker microVMs, ~150ms cold start, a Python and JavaScript SDK, and the security boundaries you need to make code-execution tools safe to expose to non-engineer teams.
Grafana Cloud is the observability backend most platform teams ship with in 2026. Loki for the audit log (with immutable log group retention), Tempo for distributed traces, Mimir for metrics, and Grafana for the dashboards. Free tier covers most early-stage agentic ops platforms.
Adoption Path: Three Milestones, Not a Big Bang
The teams that ship agentic ops platforms fastest do not try to build the whole architecture on day one. They ship in three milestones.
Milestone 1: The Gateway (Weeks 1-4)
Ship the LLM gateway and the audit log. Every team that wants to deploy an agent must route through the gateway. The gateway enforces per-team budgets and emits the audit log. This single milestone answers the auditor's "where do your agents run" and "show me agent access to PII" questions, and it gives the FinOps team a single dashboard to track agent cost.
The platform team owns the gateway. The agent teams self-onboard. This is the milestone where AI sprawl starts to consolidate.
Milestone 2: Identity and Tool Broker (Weeks 4-10)
Add per-agent identity (SPIFFE / SVID) and the tool broker. The Agent CRD lands. Teams declare their agent's tools, budgets, and egress policy in a manifest. The platform validates the manifest against the policy engine, then the AgentOperator reconciles the running workload.
This is the milestone where prompt-injection defense becomes real. Before milestone 2, an injected agent could call any tool it could find; after milestone 2, the tool broker rejects any tool call not in the manifest. The blast radius of a successful injection drops by an order of magnitude.
Milestone 3: Sandboxes and Durable Execution (Weeks 10-20)
Add the sandbox plane and wire agents into Temporal (or Inngest) for durable execution. This is the milestone where long-running, multi-step agents become safe to put in front of customers. A half-finished agent run can survive a platform restart. A misbehaving agent is contained in a sandbox that gets destroyed without affecting the host.
By the end of milestone 3, the platform team has answered every question on the auditor's list. AI sprawl is no longer a problem because every agent in the company is on the platform. The agent teams have a faster path to production than they had when they were rolling their own infrastructure.
Conclusion: The Platform Is the Product
The thesis of this article is straightforward: in 2026, the most important layer in the enterprise AI stack is not the model, not the agent framework, and not the data — it is the platform that hosts the agents. The companies that recognize this and invest in the platform will be the ones running AI infrastructure that the rest of the company can safely build on. The companies that do not will be the ones explaining to their auditor in 2027 why a marketing intern's agent had read-write access to the production customer database.
The good news is that the platform itself is composed of well-understood pieces. Per-agent identity is SPIFFE. Per-agent rate limits and budgets are gateway features. Audit logging at the gateway is an append-only log store. Prompt-injection defense is input/output filtering plus a tool broker. Blast-radius reduction is sandboxing plus network segmentation. None of these are unsolved problems.
The work is in the integration: shipping a platform where all five requirements are enforced uniformly across every agent in the company, where the platform team can answer the auditor's questions in seconds, and where the agent teams can ship a new agent in an afternoon without talking to the platform team. That is what the agentic ops platform is. That is what 2026 is asking platform engineers to build.