Monitoring the Unseen: Observability for AI/ML Pipelines

Conventional software monitoring tells you when a service is down or slow. AI/ML pipelines are different: the service can be up, responding quickly, and still producing wrong answers. The failure mode is qualitative, not quantitative - and that requires a fundamentally different approach to observability.

The RAG Pipeline Failure Modes

Retrieval-Augmented Generation pipelines have four distinct failure points, each requiring different instrumentation:

Embedding Generation: If your embedding model is returning degraded quality embeddings - due to model staleness, quantization artifacts, or a silently changed preprocessing pipeline - your retrieval will silently return worse context. Monitor the distribution of embedding vector norms: a sudden drop in average norm is a strong signal of embedding degradation.

Vector Store Retrieval: Your vector database (Pinecone, Qdrant, Weaviate) can have silent data corruption, index staleness after updates, or query planning regressions. Track query latency p99, retrieval count per query, and the cosine similarity distribution of returned results. If your average top-1 similarity drops by more than 10% week-over-week on your evaluation set, something has changed in your retrieval pipeline.

Context Assembly: When you assemble retrieved chunks into a prompt, the order and overlap matter significantly. Monitor the token count distribution of assembled contexts - if you are regularly hitting context window limits, you are silently dropping lower-ranked chunks that may be relevant.

Generation Quality: This is the hardest to monitor. Automated evaluation with LLM-as-a-judge (using a second model to score the first model's outputs) is the practical approach. Route 5% of production traffic through your evaluation pipeline and track rolling quality scores. Alert on a 5-point drop in average quality score over a 24-hour window.

Vector Database Metrics You Must Track

For any vector database deployment, these metrics are non-negotiable: query latency (p50, p95, p99), index memory usage vs. allocated quota, number of collections and their record counts, connection pool utilization, and disk I/O wait times. For Qdrant specifically, the collector_results_total metric with labels for result type (exact, approximate, rescored) tells you how often your approximate search is falling back to exact search due to HNSW recall issues.

If you are running Pinecone, their managed service exposes IndexPods and QuerySuccessRate in Prometheus format. The pod count metric is the primary signal for capacity planning - if you are consistently above 80% of your configured pod count during peak hours, you need to increase your index replica count.

LLM API Observability

When you are using external LLM APIs (OpenAI, Anthropic, Google), you lose visibility into the inference layer but gain reliability guarantees. The observability focus shifts to: API latency distribution (separate p50, p95, p99 for first token vs. full completion), token usage tracking for cost attribution, error rate by error code (rate limits require different handling than API errors), and fallback chain behavior when your primary model is unavailable.

Build a token usage dashboard by team, service, and model version. This is critical for FinOps - without per-team attribution, you cannot have meaningful cost conversations with product teams.

Evaluation as Monitoring

The most sophisticated teams have continuous evaluation running in production: every N requests, the system generates a response, runs it through an automated evaluator, and logs the quality score. This gives you a real-time quality signal rather than waiting for user reports.

The evaluator does not need to be expensive. GPT-4o-mini can reliably evaluate GPT-4o responses at 1/10th the cost. Use a golden dataset of 500-1000 question-answer pairs that span your most critical use cases, and evaluate against this dataset weekly to catch regressions before they reach users.