LLM Observability: A Complete Implementation Guide for Production AI

Running an LLM-powered application in production without observability is like flying a commercial aircraft without instruments. You might get where you are going, but when something goes wrong - and in AI systems, something will go wrong - you will have no way to diagnose it, no data to fix it, and no warning before it happens again.

The stakes are concrete. A RAG chatbot that starts returning irrelevant context will silently degrade user trust. A model endpoint that begins hallucinating facts will generate incorrect code, bad medical advice, or fabricated legal citations - damage that compounds before anyone notices. A prompt injection attack may be executing right now inside your inference pipeline, and without trace-level visibility, you would not know until damage is done.

This guide is the comprehensive, implementation-focused resource for building LLM observability into your production AI stack. It covers what to measure, how to instrument it with OpenTelemetry, which signals matter at each layer of your architecture, and how to assemble those signals into actionable dashboards and alerts. By the end, you will have a complete blueprint for a monitoring stack that works for a solo deployment or a high-throughput multi-tenant AI product.

What Is LLM Observability?

Observability is the property of a system that lets you understand its internal state from its external outputs. In traditional software, that means metrics ( counters, gauges, histograms ), traces ( causally linked records of a request as it moves through components ), and logs ( timestamped event records ). For LLM applications, the definition extends: you also need visibility into what is being sent to the model, what comes back, and whether that output is correct.

LLM observability is harder than traditional API observability for three reasons. First, language model calls are stateful - the same prompt can produce different outputs depending on context window state, temperature settings, and model version. Second, outputs are high-cardinality and semantically rich - a 500-token response is harder to summarize than a numeric status code. Third, failures in LLM systems do not look like failures in traditional software. There is no HTTP 500 for a hallucination. The model will respond confidently with wrong information, and only observable instrumentation can distinguish that from correct output.

The goal of an LLM observability implementation is to capture enough signal at each stage of your inference pipeline that you can answer four questions without翻箱倒柜: Is the model working? Is it working correctly? Is it working efficiently? And is it working safely?

The 8 Critical Signals for LLM Observability

Effective LLM observability is built on eight distinct signal categories. Each maps to a specific failure mode or performance dimension. A complete implementation covers all eight, though the depth of instrumentation you apply to each depends on your use case and risk tolerance.

Signal 1: Request Volume and Throughput

The most basic signal: how many requests are hitting your inference endpoint per minute, hour, or day. Tracking request volume tells you whether traffic is behaving normally, whether a traffic spike is causing latency degradation, and whether usage patterns have changed in ways that might affect cost. Break this down by model, by endpoint, and by user or API key if you are running a multi-tenant system.

Signal 2: Latency at Every Stage

LLM latency is not a single number. A complete latency picture requires measuring three components separately. Time to First Token (TTFT) measures the delay between request submission and the first generated token - dominated by prompt prefill. Time Per Output Token (TPOT) measures the average latency between successive tokens during generation. Total End-to-End Latency is the complete request duration from client submission to final token delivery.

Each component points to a different class of problems. A TTFT spike suggests prompt preprocessing bottleneck or GPU availability issues. A climbing TPOT indicates compute pressure or KV cache misses. For a complete treatment of LLM latency monitoring, see our LLM latency monitoring guide which covers TTFT, TPOT, and the Grafana dashboards that surface them.

Signal 3: Token Consumption and Cost

Every LLM API call consumes a billable quantity of input tokens and output tokens. Tracking token consumption at request-level granularity lets you understand cost per feature, cost per user, and cost per model. This is essential for FinOps-aware routing decisions and for identifying abnormal consumption patterns that might indicate prompt abuse or an infinite loop in your application logic.

At minimum, emit a metric per request capturing input token count, output token count, and estimated cost. Aggregate by model version, by user cohort, and by time window to build a cost attribution model that your finance and engineering teams can act on.

Signal 4: Error Rates and Failure Modes

LLM errors are not just API failures. A model can return a malformed response, generate a refusal that breaks your application logic, or produce output that fails your structured output schema. Instrument error rates at multiple levels: API-level errors (rate limit 429s, auth failures, upstream timeouts), application-level errors (Pydantic validation failures on structured output, JSON decode errors), and quality-level failures (outputs that are too short, too long, or flagged by a lightweight quality check).

Signal 5: Output Quality and Hallucination Detection

This is the hardest signal to instrument well. Hallucinations - outputs that are confident, coherent, and wrong - are the defining reliability challenge of production AI. They do not produce error codes. They require active detection mechanisms.

The practical approach to hallucination detection combines three techniques. Semantic similarity scoring measures how closely the generated response aligns with retrieved context in a RAG pipeline - low similarity is a leading indicator of hallucination. Ground-truth comparison uses a reference answer (where available) to flag responses that diverge significantly from expected output. Structured output validation applies a schema to model responses and flags any output that fails to conform.

For a complete treatment of hallucination monitoring in production, see our hallucination monitoring guide which covers evaluation frameworks, semantic embedding pipelines, and the alerting thresholds that catch regressions before they reach users.

Signal 6: Prompt and Response Tracking

Every LLM call carries a prompt - often containing user-provided input, retrieved context, or system instructions - and produces a response. Storing these pairs is essential for debugging quality regressions, reproducing user-reported issues, and building evaluation datasets when you discover failures after the fact.

The practical implementation is a sampled trace store: capture a representative subset of prompt/response pairs (for example, 1% of traffic plus all flagged errors) in object storage with a reference ID that links back to your trace spans. This gives you the debugging data you need without the cost of storing every single call.

Signal 7: Retrieval Quality in RAG Systems

If your application uses retrieval-augmented generation (RAG), the quality of your retrieval directly determines the quality of your outputs. Monitoring retrieval quality means measuring two things: the relevance score of returned chunks and the coverage of the context window relative to what the user asked.

Low relevance scores - chunks returned by the vector store that do not directly address the query - are a leading indicator of RAG pipeline degradation. This can happen when your embedding model falls out of sync with your chunking strategy, when your vector database index is stale, or when user queries drift from the distribution your embedding model was trained on.

Our RAG observability guide covers retrieval quality metrics in depth, including the signal schema for chunk relevance tracking and the dashboard patterns that surface retrieval regressions early.

Signal 8: Security and Abuse Signals

LLM applications are targets for prompt injection, data exfiltration, and abuse. Security observability for AI systems means monitoring for anomalous patterns that suggest an attack: unusually long prompts (potential injection payload), repeated requests for the same content (scraping), prompts that contain known malicious instruction patterns, and responses that include PII the model should not have access to.

These signals overlap significantly with the hallucination detection stack - a response that diverges sharply from retrieved context might be a hallucination, or it might be the result of an instruction buried in a prompt injection attack. See our LLM security hardening guide for the monitoring patterns that catch both categories of risk.

OpenTelemetry: The Instrumentation Foundation

OpenTelemetry (OTel) is the open standard that ties a production observability stack together. It provides vendor-neutral APIs, SDKs, and collectors for collecting traces, metrics, and logs - the three pillars of observable systems. For LLM applications, OpenTelemetry is the right choice because it decouples your instrumentation from your observability backend, meaning you can start with a local development setup and migrate to a production backend (Grafana, Datadog, Honeycomb, or any other OTel-compatible platform) without changing your application code.

LLM Instrumentation with OpenTelemetry Spans

The core unit of OTel tracing is the span - a record of an operation with a start time, end time, attributes, and optional events. An LLM request maps naturally to a span hierarchy:

Root span: The entire request lifecycle from your API endpoint receiving the call to the final response being returned
Preprocessing span: Embedding generation, retrieval, context assembly
Inference span: The actual LLM API call
Postprocessing span: Response parsing, structured output validation, formatting

Each span should carry attributes that make it queryable: model name and version, input token count, output token count, temperature setting, latency measurements (TTFT, TPOT if available), and any custom metadata that identifies the user, session, or feature flag context.

Here is the minimal span attribute schema for a production LLM inference span:

span.set_attribute("llm.model", "gpt-4o") span.set_attribute("llm.version", "2026-03-15") span.set_attribute("llm.input_tokens", 487) span.set_attribute("llm.output_tokens", 312) span.set_attribute("llm.temperature", 0.7) span.set_attribute("llm.latency.ttft_ms", 420) span.set_attribute("llm.latency.tpot_ms", 28) span.set_attribute("llm.cost.estimated", 0.0234) span.set_attribute("user.id", "usr_abc123") span.set_attribute("request.id", "req_xyz789")

Metrics with the OTel SDK

Spans give you per-request detail. Metrics give you aggregated visibility. The OTel SDK lets you emit three metric types that map directly to the eight signals above:

Counters for additive values: request count, token count, error count. A counter for llm.requests.total with labels for model and status code gives you error rate. A counter for llm.tokens.total with labels for input/output and model gives you cost attribution.

Histograms for distributions: TTFT, TPOT, end-to-end latency. Use histograms with appropriate bucket boundaries (for TTFT: 100ms, 250ms, 500ms, 1s, 2s, 5s; for TPOT: 10ms, 25ms, 50ms, 100ms, 200ms) so your p50, p95, and p99 percentiles are meaningful.

Gauges for point-in-time values: current queue depth, GPU memory utilization, active request count. Gauge values only persist until the next scrape, so use them for values that change rapidly and need to be sampled rather than accumulated.

The Collector Pipeline

The OTel Collector is a middleware process that receives telemetry data from your application and exports it to your observability backend. The collector architecture is three-part: receivers (how data enters), processors (how data is transformed or filtered), and exporters (where data goes). For an LLM observability stack, the collector pipeline typically looks like this:

Your application SDK sends traces and metrics to the collector via OTLP (the OTel native transport protocol) over gRPC or HTTP. The collector batches and serializes this data and exports it to your backend. For a Grafana-based stack, the exporter is OTLP over gRPC to Grafana Tempo (traces) and Prometheus (metrics). For a cloud-native setup, you might export to AWS CloudWatch, Google Cloud Monitoring, or Azure Monitor.

The collector is also where you apply sampling. For high-throughput LLM applications, capturing every span at full fidelity is cost-prohibitive. Apply tail-based sampling at the collector: capture 100% of errors and flagged quality failures, 1-5% of successful requests. This gives you enough data to debug issues without overwhelming your storage or incurring massive observability costs.

Tool Spotlight Ship LLM Traces in Minutes with Helicone

Helicone is the fastest way to add OpenTelemetry-native LLM tracing to your production stack. Zero SDK changes required - just point your API calls through the Helicone proxy and get instant visibility into latency, token usage, and cost.

Building the Monitoring Stack

Instrumentation produces data. A monitoring stack organizes that data into signals that your team can act on. The recommended stack for LLM observability combines open source tools that are well-integrated and avoid per-token pricing: Prometheus for metrics collection, Grafana for visualization and alerting, Loki for log aggregation, and Tempo for distributed tracing.

Prometheus: Metrics Collection

If you are self-hosting inference engines like vLLM, OpenLLM, or TGI, those servers expose Prometheus metrics endpoints natively. You do not need an SDK to get GPU utilization, queue depth, TTFT histograms, and token throughput - the server emits them and Prometheus scrapes them on a defined interval.

For model API calls made to external providers (OpenAI, Anthropic, Azure OpenAI), instrument those calls with the OTel SDK and route the resulting metrics to Prometheus via the OTLP receiver. This gives you a unified metrics namespace where infrastructure signals (GPU utilization) and application signals (token consumption) live in the same queries.

The vLLM production monitoring guide has the complete Prometheus scrape configuration and the specific metric names vLLM exposes for TTFT, TPOT, queue depth, and cache hit rates.

Grafana: Dashboards and Alerting

Grafana connects to your Prometheus, Loki, and Tempo instances and provides the visualization layer. A production LLM observability dashboard should have four panels:

The Request Overview panel shows request volume, error rate, and latency percentiles (p50, p95, p99) over time. Anomalies in this panel trigger investigation into the other three panels.

The Latency Breakdown panel decomposes request latency into TTFT, inference time, and postprocessing time using the histogram buckets you defined in your instrumentation. This tells you which phase of the pipeline is the bottleneck.

The Cost Attribution panel shows cumulative cost by model, by user cohort, and by day. This is your FinOps signal - unexpected cost spikes trigger investigation into token consumption anomalies.

The Quality Signals panel shows hallucination detection rates, structured output validation failure rates, and retrieval quality scores (if you are running RAG). These metrics trend more slowly than latency or cost but are the most important for catching production incidents that do not show up in traditional error rate metrics.

The open-source LLM monitoring stack guide covers the complete Grafana dashboard configuration including panel definitions,PromQL queries, and alerting rules.

Tempo: Distributed Tracing

Prometheus metrics tell you that something is wrong. Traces tell you where. For LLM applications with complex request pipelines - multiple retrieval steps, multi-model routing, chain-of-thought generation - Tempo traces let you navigate from an alerting symptom (e.g., p99 latency spike) to the specific span that caused it.

Configure your OTel SDK to export traces to Tempo via OTLP. In Tempo, you can search spans by any attribute (model name, user ID, request ID) and navigate the full trace tree for a problematic request. Tempo's trace view shows you the latency waterfall and lets you identify which step in a multi-stage pipeline is the bottleneck.

Commercial Platforms vs. Open Source: Making the Choice

The instrumentation described above - OpenTelemetry SDK, OTel Collector, Prometheus, Grafana, Tempo - is the open source path. It gives you complete control over your data, no per-token pricing, and full customization. The cost is setup time: expect 4-8 hours to get a production-ready deployment.

Commercial LLM observability platforms - Helicone, Portkey, and LangSmith - offer a faster path by handling the collection, storage, and visualization layer as a managed service. You instrument with their SDK or route through their proxy, and the platform handles the rest.

For a detailed comparison of these platforms including their instrumentation models, pricing, and tradeoffs, see our LLM observability tools comparison.

The decision framework is straightforward. If data privacy is a hard requirement (your prompts contain PII or proprietary code), if you are processing tens of billions of tokens per month (commercial per-token fees become prohibitive), or if you need deep customization of your observability signals - start with the open source stack. If you need to get observability running in hours rather than days and you are comfortable with a managed service, the commercial platforms are the right call.

Hybrid: Open Source Instrumentation, Commercial Storage

A practical middle ground used by many production teams: instrument with OpenTelemetry (giving you vendor-neutral telemetry) and export to a commercial platform that accepts OTLP. Portkey and Helicone both support OTLP ingestion, meaning you can run the open source OTel SDK in your application and route traces and metrics to a commercial backend without changing your instrumentation code.

Tool Spotlight Route LLM Traces to Any Backend with Portkey

Portkey's AI gateway lets you route traces from your OpenTelemetry-instrumented LLM application to any observability backend. Get the flexibility of open source instrumentation with the managed infrastructure of a commercial platform.

Alerting for LLM Systems

Dashboards show you the present and the past. Alerts wake you up when something needs attention right now. LLM alerting requires a different philosophy than traditional API alerting because the failure modes are different. Here are the alert rules that matter for production LLM systems:

Error rate spike: Alert when the 5-minute error rate exceeds 1% (or your service's baseline). Include a breakdown by error type in the alert payload so the on-call engineer knows whether this is an auth issue, a rate limit, or a model-side failure.
Latency regression: Alert when p99 latency exceeds 2x your baseline for the current time window. LLM latency is workload-dependent, so compare to the same hour of the previous week rather than a static threshold.
Cost anomaly: Alert when hourly token spend exceeds 3x the rolling average. A sudden cost spike often indicates an application bug (stuck loop, missing stop condition) or an abuse pattern.
Quality regression: Alert when your hallucination detection system flags a response with confidence above a threshold (e.g., 0.9) that diverges from retrieved context below a similarity threshold (e.g., 0.6). This catches the silent failures that do not show up in traditional error metrics.
Retrieval quality drop: Alert when average chunk relevance score in your RAG pipeline drops below your established baseline. Retrieval quality degradation is a leading indicator of output quality degradation that precedes user complaints by hours or days.

Continuous Evaluation in Production

Static alerting catches acute incidents. Continuous evaluation catches gradual regressions that slip under alert thresholds but accumulate into user-visible degradation. The practice is to run a sample of production traffic through your evaluation pipeline alongside live requests - a shadow mode where you capture both the live response and the evaluation result without affecting the live response.

Build an evaluation dataset from production by sampling requests with diverse inputs, store the prompt and response pairs, and periodically re-run your evaluation framework against those stored pairs. Compare evaluation scores week-over-week: if your BLEU score, RAGAS metric, or custom quality signal trends downward, you have a regression that needs investigation before it affects a larger fraction of users.

LangSmith, Arize, and WhyLabs all provide production evaluation pipelines as managed services. For teams building on the open source stack, the evaluation framework from the open-source LLM monitoring stack can be extended with a scheduled job that re-evaluates stored production traces against your quality signals.

What to Instrument First: A Prioritized Approach

If you are building this from scratch and you need to prioritize, here is the order that maximizes early signal for the least instrumentation effort:

Week 1: Latency and Error Rates. Instrument your inference calls with spans and emit latency histograms and error counters. Point your inference server's Prometheus endpoint at Grafana. You now have the baseline for every other measurement - you know what normal looks like.

Week 2: Token Tracking and Cost. Add input and output token counts to your spans and emit counters by model. Build the cost attribution dashboard. You now know what you are spending and on what.

Week 3: Prompt and Response Sampling. Add a sampled trace store for prompt/response pairs. Even 1% sampling gives you a debugging dataset. Link traces to errors so an on-call engineer can navigate from an error alert to the exact prompt that caused it.

Week 4: Quality Signals. Implement semantic similarity scoring for RAG pipelines, structured output validation for formatted responses, and hallucination detection for factual outputs. Connect these to alerting so quality regressions wake someone up.

By the end of month one, you have complete coverage of the eight signals, a working dashboard, and alerting rules. You can iterate on depth from there.

The Monitoring Stack Across Team Sizes

The stack described in this guide scales from a solo developer to a large engineering team. The components are the same; the operational overhead is what changes.

A solo developer or small team (2-5 engineers) should use the managed commercial platforms for the fastest path. Helicone for request-level visibility, Portkey for multi-model routing with observability, and a lightweight Grafana dashboard for the signals you want to own. The goal is to get observability running before your first production incident, not to build the perfect infrastructure.

A mid-sized team (5-20 engineers) running production AI should own their OpenTelemetry instrumentation and run the open source stack. The setup cost (4-8 hours) pays back in months of operational clarity. Use a managed Grafana Cloud or AWS Managed Grafana to avoid operating the Grafana server itself.

A large team (20+ engineers) with a dedicated platform or ML infrastructure team should build a centralized observability platform that teams can self-serve. Instrument with OTel, route all traces and metrics to a central OTel Collector cluster, and expose a shared Grafana dashboards and alert manager. Enforce a consistent span attribute schema across teams so that any engineer can query any team's LLM traffic by model, user cohort, or feature flag.

Conclusion

LLM observability is not optional for production AI systems. The failure modes - hallucinations, latency spikes, cost anomalies, prompt injection - are real, they happen, and without observable instrumentation you will discover them the hard way: through user complaints, through billing surprises, or through incidents that compound before anyone has context to debug them.

The eight signals in this guide - request volume, latency at every stage, token consumption, error rates, output quality, prompt and response tracking, retrieval quality, and security signals - cover the full surface area of a production LLM application. OpenTelemetry provides the vendor-neutral instrumentation layer that lets you implement them without committing to a single observability backend.

Start with latency and error rates. Add cost tracking. Then layer in quality signals as your system matures. The investment compounds: every signal you add makes the next layer more actionable, and the debugging data you collect during normal operation becomes the foundation for continuous evaluation, regression detection, and the trust that your AI system is doing what you designed it to do.

For deeper dives into specific dimensions of LLM observability, continue with our guides on LLM latency monitoring, hallucination detection, RAG observability, LLM security hardening, and the open-source LLM monitoring stack.