Site Reliability Engineering was born at Google to bridge the gap between software engineering and operations. The core insight — treat reliability as a feature, measure it rigorously, and build systems that get safer over time — applies as much to AI systems as it does to web servers. But AI workloads break in ways traditional SRE was never designed to handle.
LLMs hallucinate. Retrieval systems drift. Model outputs vary even when inputs are identical. A database query returns a row; a semantic search returns a probability distribution. The operational assumptions that underpin classical SRE — determinism, reproducible failures, clear root causes — dissolve when you're serving probabilistic models.
This playbook is for SREs and platform engineers who are now responsible for AI infrastructure. It covers the specific failure modes of AI systems, how to define and measure reliability for them, and the operational patterns that keep AI services healthy in production.
How AI Systems Break Differently
Before you can apply SRE to AI, you need to understand what you're dealing with.
Traditional services fail loudly. A web server crashes and returns a 500. A database times out and the error is explicit. Failures are binary and observable.
AI systems fail silently and probabilistically. A retrieval-augmented generation system can return confident, well-formulated answers that are completely wrong. A model can degrade gradually as data drift pulls its embeddings off-distribution. An LLM can begin refusing certain types of requests after a fine-tuning run — not because of a code change, but because of a subtle shift in the model's behavior distribution.
This has three consequences for SRE:
- You need behavioral monitoring, not just infrastructure monitoring. CPU and memory are necessary but insufficient. You need to track output distributions, response quality signals, and retrieval precision over time.
- You need automated evaluation infrastructure. Human spot-checking doesn't scale. You need pipelines that continuously score model outputs against ground truth or reference answers.
- Your incident response playbooks need to account for non-deterministic failures. The same inputs that triggered a hallucination at 2 PM might produce correct output at 2:05. Reproducing AI failures requires capturing not just stack traces but input prompts and context.
Defining SLOs for AI Systems
Service Level Objectives for AI must capture both infrastructure reliability and model behavior reliability.
Infrastructure SLOs
These are familiar territory for any SRE:
- Availability: The service responds within the SLO threshold (e.g., 99.9% of requests return a response within 2 seconds)
- Latency: p50, p95, and p99 response times for inference requests
- Throughput: Successful requests per minute, tokens per second
These are necessary but not sufficient. An LLM can be up and responding within latency thresholds while producing nonsense on 30% of queries.
Model Behavior SLOs
This is where AI SRE diverges from classical SRE.
Task Completion Rate (TCR): The percentage of requests where the model produces a functionally correct, non-harmful response. This requires an automated eval pipeline — a golden dataset of input-output pairs, scored by a reference model or a set of programmatic validators.
Hallucination Rate: The percentage of responses flagged by your detection pipeline as hallucinated or factually unsupported by the retrieved context. This maps to the "correctness" dimension of your model's output quality.
Refusal Rate: The percentage of valid requests the model refuses. A sudden spike in refusals often signals a model behavior change — a fine-tuning run, a prompt drift, or a context overflow issue.
Retrieval Precision: For RAG systems, the percentage of retrieved chunks that are relevant to the query. This is a leading indicator — if retrieval precision drops, your downstream TCR will follow.
Context Utilization: How much of the retrieved context the model actually uses in its answer. Low utilization suggests the model is ignoring retrieved context and relying on parametric knowledge, increasing hallucination risk.
Composite AI Reliability Score
A useful single number for executive reporting and SLO tracking is a weighted composite:
AI Reliability Score = (TCR × 0.4) + (Availability × 0.3) + ((1 - Hallucination Rate) × 0.3) This gives you a 0-1 score that captures both infrastructure and model quality. Track it over time in your Grafana dashboard alongside your standard SLO burn rates.
Error Budgets for AI
Error budgets are the central mechanism by which SREs make risk trade-offs. The concept translates directly to AI systems, but the measurement is different.
Setting AI Error Budgets
For a production AI service targeting 99.9% overall AI reliability:
- Monthly AI error budget: 0.1% of requests can produce substandard outputs (as defined by your TCR and hallucination thresholds)
- If you're at 99.5% AI reliability: You've consumed 80% of your monthly error budget in the first week — time to investigate
Using Error Budgets to Drive Development
Error budget policy should govern how you roll out changes:
- Below 50% error budget consumed: Ship new features, run experiments, fine-tune aggressively
- 50-90% consumed: Slow down, focus on reliability work, pause non-critical experiments
- Above 90% consumed: Feature freeze, all hands on reliability until metrics recover
This is identical to classical SRE error budget policy — the only difference is that you're measuring AI behavior quality alongside infrastructure metrics.
For AI systems, a useful additional policy: always retain a baseline evaluation pipeline when deploying model changes. If a new model version causes a statistically significant drop in TCR or spike in hallucination rate, the deployment should be automatically rolled back.
AI Incident Response Playbook
AI incidents require a specific playbook structure that accounts for probabilistic failures.
When to Declare an AI Incident
Not every bad model output is an incident. Use these triggers:
- Automated eval score drops by more than 5 percentage points from baseline (TCR, hallucination rate, or refusal rate)
- p95 inference latency increases by more than 2× for 10+ minutes
- Retrieval precision drops below your SLO threshold for more than 5 minutes
- Error rate on structured output parsing exceeds 1% (model returning malformed JSON, unexpected formats)
- User-reported hallucination rate spikes (if you have user feedback channels)
Incident Response Steps
Phase 1 — Triage (0-5 minutes)
- Check infrastructure metrics first: GPU utilization, memory, inference server health, KV store latency. You want to rule out infrastructure before assuming it's a model behavior issue.
- Pull recent request logs. Identify if there's a pattern: specific time, specific input type, specific model version, specific user segment.
- Run a quick eval on the last 50 requests against your golden dataset. Is the TCR drop isolated to specific query types, or is it broad?
Phase 2 — Scope (5-15 minutes)
- Determine if this is a model behavior regression or an infrastructure degradation. If inference latency is fine but eval scores are down, it's model behavior. If latency is spiking, it's infrastructure.
- For model behavior regressions: check what changed. Model version? Prompt template? Retrieval index update? Fine-tuning run? RAG context window size?
- Check for data distribution shift: are inputs this week different from last week? Check embedding distribution statistics if available.
Phase 3 — Mitigate (15-45 minutes)
- If infrastructure: Standard SRE playbook — roll back recent changes, scale up resources, isolate failing components.
- If model behavior:
- Roll back to previous model version if available (maintain versioned model registry)
- Fall back to a smaller, more conservative model if available
- Reduce temperature settings (lower temperature = more deterministic outputs)
- For RAG systems: disable retrieval fallback to parametric memory only (higher hallucination risk, but more consistent)
- Increase strict output validation (enforce structured output schemas more aggressively)
- Enable heightened monitoring: increase eval frequency from hourly to per-15-minutes during the incident window.
Phase 4 — Post-Incident (1-24 hours after resolution)
- Run full eval suite against the incident time window's inputs. Build a regression test case for any failure mode discovered.
- Write the postmortem. For AI incidents, the postmortem should include:
- Model behavior timeline: when did eval scores start degrading, was it gradual or sudden?
- Prompt/context analysis: what was the distribution of inputs during the incident?
- Contributing factors: infrastructure, data drift, model version, prompt changes, fine-tuning
- Detection time: how long between the first eval score drop and the incident declaration?
- Root cause: infrastructure failure, data pipeline issue, model regression, upstream dependency?
- Action items: specific, assignable, time-bound
On-Call Runbook: AI Degradation
Create a runbook document for each distinct AI service. Here's a template for an LLM inference service:
## LLM Inference On-Call Runbook
### Symptom: Elevated Hallucination Rate
1. Check Grafana dashboard — AI Reliability Score, TCR, Hallucination Rate (last 1h)
2. Check model version in production (model registry)
3. Check recent deployments (was a new model version pushed?)
4. Check retrieval system health:
- Embedding service latency
- Vector DB query latency
- Index freshness (last update timestamp)
5. Check input distribution: average token count, language distribution, query type distribution
6. If data drift detected: pause retraining pipeline, revert to last known good index
7. If model regression suspected: initiate model rollback procedure
8. Escalate to ML Platform team if not resolved within 30 minutes
### Symptom: Elevated Latency
1. Check GPU utilization and VRAM usage (dcgm-exporter metrics)
2. Check inference batch size and queue depth
3. Check for noisy neighbor processes on GPU nodes
4. If throughput saturated: scale inference replicas (horizontal)
5. If VRAM pressure: reduce batch size, enable KV cache compression
6. If model size issue: consider model quantization or switching to smaller model variant
7. Escalate if latency exceeds SLO by more than 5× for more than 15 minutes AI Chaos Engineering
Just as chaos engineering tests infrastructure resilience, you need chaos engineering for AI resilience.
AI Chaos Experiments
Prompt injection simulation: Send adversarial prompts designed to bypass system instructions. Monitor whether your guardrails catch them and whether model outputs leak sensitive information.
Context overflow testing: Send inputs at the maximum context window length. Verify that the system handles it gracefully — truncates, returns a clear error, or degrades predictably rather than returning garbage.
Retrieval failure testing: For RAG systems, simulate vector DB failures or index corruption. Verify that the system falls back to parametric memory (with appropriate confidence warnings) or returns a graceful error rather than hallucinating.
Model version rollback testing: Practice rolling back to a previous model version in staging. Verify that your evaluation pipeline catches behavioral regressions before they hit production.
Latency injection: Artificially increase inference latency to simulate GPU contention. Verify that your timeout and retry logic works correctly and that upstream services handle degraded latency gracefully.
Running AI Chaos Experiments
- Run experiments in staging first, never in production without a rollback plan
- Establish a "game days" cadence — quarterly at minimum, monthly for high-stakes AI services
- Track experiment results: what broke, how long detection took, how long recovery took
- Automate the experiments that are safe to automate (prompt injection, latency injection) into your CI/CD pipeline
Monitoring Stack for AI SRE
To operationalize everything in this playbook, you need the right monitoring infrastructure:
Prometheus + Grafana for infrastructure metrics: GPU utilization, VRAM, inference latency percentiles, request throughput, error rates. The open-source LLM monitoring stack from our previous article provides a solid foundation.
Managed Prometheus, Grafana, and Loki for AI infrastructure monitoring. GPU metrics, inference tracing, and custom dashboards — no server maintenance required.
OpenTelemetry for distributed tracing: instrument your inference requests end-to-end. Capture prompt, model version, retrieval context, and response metadata as span attributes. This is essential for incident investigation — you need to replay exact inputs.
Evaluation pipeline for behavioral metrics: run a golden dataset eval on a schedule (hourly in production, continuous in staging). Alert on TCR drops and hallucination rate spikes. Tools like Arize Phoenix provide built-in eval pipelines, or build your own with open-source evaluation frameworks.
Open-source observability platform for LLMs. Supports trace collection, eval pipelines, embedding drift detection, and RAG evaluation out of the box.
For teams running structured eval pipelines at scale, Weights & Biases Weave provides automatic trace instrumentation, dataset versioning, and eval result tracking — with direct integration into CI/CD for regression detection before deployments ship.
Experiment tracking, model versioning, and production monitoring for ML teams. Weave provides automatic LLM tracing with eval pipeline integration out of the box.
Structured logging for AI requests: log not just request/response metadata but also:
- Token counts (input and output)
- Model version and temperature
- Top-level sampling probabilities (logprobs)
- Whether retrieval was used and how many chunks were retrieved
- Output schema validation results (for structured outputs)
This data is invaluable in postmortems. Store it in object storage (S3-compatible) with a 30-day hot retention period for fast querying.
Building AI Reliability Culture
The tooling and processes above only work if your organization treats AI reliability as a shared responsibility — not just the ML team's problem.
SREs should understand AI failure modes. You don't need to train as an ML researcher, but you need to understand why hallucination happens, what affects retrieval precision, and how model temperature settings change output distribution.
ML engineers should understand SLOs. Model development teams need to internalize that a model isn't production-ready until it's been evaluated against an SLO. Define behavioral SLOs before the model ships, not after.
Run joint incident reviews. When an AI incident occurs, include both SREs and ML engineers in the postmortem. The SRE brings infrastructure analysis; the ML engineer brings model behavior analysis. Together they build a complete picture.
Invest in observability before scaling. Every AI service should have behavioral monitoring in place before it goes to production. It's much harder to add eval infrastructure after users are affected.
The Bottom Line
SRE for AI systems is harder than classical SRE. The systems are probabilistic, the failure modes are less predictable, and the operational intuition that SREs have built over decades doesn't fully transfer. But the core principles are timeless: define what reliable means, measure it rigorously, build systems that degrade gracefully, and treat every incident as an opportunity to get safer.
The AI systems you're running today will be more reliable tomorrow if you apply these practices systematically. Start with SLOs — define what good looks like for your specific use case. Build your eval pipeline. Write your runbooks. Run your chaos experiments. And when the incident comes, investigate the model behavior as carefully as you'd investigate the infrastructure.
That's the practical playbook for SRE in the AI era.