The hardest part of running LLMs in production is not getting them to work. It is knowing when they have stopped working.
Demonstrations are misleading. Feed an LLM a curated prompt in a controlled environment and it will produce polished, confident outputs that look like intelligence. Push that same system into production -- noisy user queries, stale retrieval contexts, distribution shifts from the training data, adversarial inputs -- and the same model will surface hallucinated citations, emotionally tone-deaf responses, and factual claims that range from incorrect to dangerous.
The traditional solution is human evaluation. Have an expert review outputs before they reach users. This works for demos. It does not scale. At any meaningful volume, human evaluation becomes the bottleneck that kills deployment velocity or the blind spot that lets bad outputs through.
Automated LLM evaluation frameworks solve this. They provide consistent, programmatic assessment of LLM output quality across dimensions that matter: answer correctness, retrieval precision, factual attribution, safety, and response coherence. When integrated into a CI/CD pipeline, they can gate deployments, trigger alerts when behavior drifts, and give teams objective signal on whether a model change is an improvement or a regression.
This guide covers the production evaluation stack: the frameworks that have proven reliable at scale, the metrics that translate into actionable signal, and the architectural patterns that make evaluation a first-class part of the LLM lifecycle rather than an afterthought.
Why Automated Evaluation Is Hard
Before diving into specific frameworks, it is worth understanding why LLM evaluation is genuinely difficult -- and why most naive approaches fail.
LLM outputs are not software outputs. A software function is correct or incorrect. An LLM output is correct relative to some implicit standard that is expensive to define and harder to automate against. Different evaluators, different contexts, different use cases -- the same output can be genuinely good, contextually adequate, or dangerously wrong depending on what the user needs.
This manifests in three specific challenges:
Ground truth is scarce. For most real-world LLM applications -- legal contract analysis, medical literature summarization, customer support response generation -- there is no ground truth dataset. The domain is too complex, too context-dependent, or too proprietary to have a clean correct answer to evaluate against. Building one requires domain experts who are expensive and scarce.
Quality is multidimensional. A response that scores high on fluency may score low on factual accuracy. A response that correctly cites a source may do so in a misleading way. Evaluation frameworks must weight multiple dimensions -- and those weights are use-case specific.
Distribution shifts over time. The queries your system receives today are not the queries it received three months ago. User behavior drifts, product context changes, retrieval indexes become stale. An evaluation framework that worked last quarter may be measuring against an outdated distribution.
Frameworks like RAGAS and TruLens are designed to address these challenges in specific ways. They do not eliminate them -- no framework can fully automate the judgment call of whether an output is "good enough" for a given context -- but they provide structured, repeatable signals that are far more reliable than spot-checking or gut feeling.
RAGAS: Retrieval-Augmented Generation Assessment
RAGAS (Retrieval-Augmented Generation Assessment) is an open-source evaluation framework purpose-built for RAG pipelines. It evaluates RAG systems on two primary dimensions: the quality of the retrieval step and the quality of the generation step given that retrieved context.
What makes RAGAS distinctive is that it does not require ground truth answers. Instead, it evaluates components of the RAG pipeline using LLMs as judges -- a technique called LLM-based evaluation or model-based evaluation. The framework decomposes the pipeline into discrete steps and evaluates each step independently.
Core RAGAS metrics:
Context Precision measures whether the retrieved context actually contains the information needed to answer the query. It penalizes retrieved chunks that are irrelevant or only tangentially related to what the user asked. This is computed as a ranking-based score: the higher the relevant chunks appear in the retrieval results, the better.
Context Recall measures whether all the information needed to produce a correct answer is present in the retrieved context. If the answer requires three facts and the retrieval only captures two, context recall is low. This requires small hand-crafted datasets of question-answer-context triplets, but the requirements are minimal compared to full ground truth generation.
Answer Faithfulness evaluates whether the generated answer is actually supported by the retrieved context -- whether the LLM is generating facts that the context does not contain. This is the metric most directly connected to hallucination detection: a faithful answer will not make claims that cannot be traced back to the retrieved documents.
Answer Relevancy measures whether the generated answer actually addresses what the user asked. A high-faithfulness answer that goes off on a tangent or addresses the wrong framing gets penalized here. This is computed by comparing the embedding similarity between the generated answer and a re-generated version of the same answer after prompting the LLM to rephrase it.
RAGAS works by calling an LLM -- typically GPT-4 or a comparable model -- to evaluate each metric. The framework constructs specific prompts for each assessment dimension, passes the relevant pipeline artifacts (query, retrieved context, generated answer) as context, and parses the numeric score from the LLM response. This makes it relatively straightforward to integrate into existing pipelines: RAGAS is a Python library with a clean API.
The main limitation is cost and latency. RAGAS makes LLM calls per evaluation per data point. Evaluating 1,000 query-response pairs across four metrics means 4,000 LLM calls. At GPT-4 pricing, this adds up fast. Teams typically use RAGAS for batch evaluation on a sample of production traffic rather than continuous evaluation on every request.
Weave an experiment tracking tool like W&B Weave into your RAGAS evaluation pipeline to log scores, compare model versions, and visualize drifts over time.
TruLens: Causal Attribution for LLM Systems
Where RAGAS focuses on evaluating the outputs of a RAG pipeline, TruLens focuses on explaining and attributing behavior within LLM applications. It is particularly valuable when you need to understand why a system produced a given output -- which components contributed to the final result and how changes to those components affect behavior.
TruLens is built around the concept of attribution. In a TruLens-instrumented application, every component -- the retrieval step, the prompt construction, the model call, the response parsing -- is tracked with a unique identifier. When an output is generated, you can query the history of which components were involved and how they contributed to the final result.
TruLens core capabilities:
Groundedness Feedback is TruLens's answer to answer faithfulness. It evaluates whether each statement in a generated response can be traced back to a specific source in the retrieval context. Unlike simple citation matching, groundedness feedback accounts for paraphrasing -- the LLM may restate a fact in different words, and TruLens evaluates whether the underlying claim is still supported by the source material. This is directly applicable to hallucination detection: outputs with low groundedness scores are flagging hallucinated or fabricated claims.
Answer Relevance Feedback evaluates whether the response actually addresses the input query. TruLens generates alternative questions that the response would answer and compares their semantic similarity to the original question. Low relevance scores indicate responses that are off-topic, tangential, or otherwise failing to engage with what the user asked.
Feedback Modes support both automated and human-in-the-loop evaluation. In automated mode, LLM-based feedback provides fast, consistent scoring. In human-in-the-loop mode, TruLens surfaces specific outputs for human review -- particularly useful for flagging high-stakes cases where automated scoring alone is insufficient.
TruLens integrates with major LLM frameworks including LangChain, LlamaIndex, and direct API access. The instrumentation approach is lightweight: wrap your existing application code with TruLens decorators or context managers, and the framework handles the tracking. The resulting attribution graph gives you explainability that is otherwise missing from black-box LLM applications.
Building a Production Evaluation Pipeline
RAGAS and TruLens are evaluation frameworks -- they provide the scoring logic. What turns them into a production evaluation pipeline is the surrounding infrastructure: data collection, scoring automation, threshold setting, and alerting.
A production evaluation pipeline has four components:
1. Evaluation Dataset Management
The pipeline needs representative data to evaluate against. This means building an evaluation dataset -- a collection of query-response pairs that span the distribution of inputs your system encounters in production. For most teams, this means:
- Seed set: Curated examples from domain experts that cover critical cases -- high-stakes queries, common edge cases, known failure modes from past incidents
- Production sample: Periodically sampled queries from production traffic, human-labeled by a small team for correctness or groundedness
- Adversarial set: Synthetic queries generated by prompting an LLM to produce edge cases -- vague phrasings, ambiguous references, contradictory constraints
The evaluation dataset is never static. As your application evolves, new failure modes emerge. The dataset needs a refresh cadence -- at minimum quarterly, monthly for high-change systems.
2. Scoring Automation
With evaluation data in hand, the scoring layer runs RAGAS or TruLens against the dataset on a regular schedule. Most teams run evaluation on a nightly or weekly batch job, computing aggregate scores across all evaluation examples and breaking down by metric, query category, or model version.
Scoring automation must be instrumented for cost tracking. RAGAS evaluation on a 1,000-example dataset can generate 4,000+ LLM calls. Running this daily with GPT-4-class models costs real money. Teams often use a tiered approach: lightweight metrics (latency, token count, retrieval precision) on every run, full LLM-judged metrics on a weekly sample.
3. Threshold Configuration
Scores without thresholds are useless. A context precision score of 0.73 is meaningless without a definition of "good enough" for your application. Threshold setting is inherently use-case specific and should be set by the team that owns the application with input from stakeholders who understand the cost of errors.
Some threshold-setting approaches that work in practice:
- Historical baseline: Set thresholds relative to the first evaluation run or the last stable deployment. If your faithfulness score was 0.81 before the model update and drops to 0.74 after, that is a measurable regression regardless of absolute value.
- Acceptance criteria: Define what "acceptable" means for the application. A medical RAG system needs higher faithfulness thresholds than a marketing copy generator. The application owner sets these in consultation with domain experts.
- Competitive benchmarking: Compare scores against competing systems or industry standards. If your RAG system scores 0.68 on answer faithfulness and the best-in-class alternative scores 0.85, you have a measurable gap.
4. Alerting and Deployment Gating
The evaluation pipeline closes with action. When scores drop below thresholds -- either on the aggregate or on a specific slice -- the pipeline should trigger one of:
- Alert: Notify the on-call team that evaluation scores have degraded. This is appropriate for gradual drift where no immediate deployment decision is needed.
- Deployment gate: Block a model update or configuration change from reaching production until scores recover. This is appropriate for major regressions that could affect users immediately.
- Incident creation: Open a tracking ticket for the regression with evaluation evidence attached. This is appropriate when the regression is significant but not catastrophic.
Use Weave -- W&B's lightweight evaluation tracking -- to log evaluation scores, set threshold alerts, and compare model performance across versions automatically.
Comparison: RAGAS vs TruLens
RAGAS and TruLens address overlapping but distinct aspects of LLM evaluation. The choice depends on what you need to measure.
| Dimension | RAGAS | TruLens |
|---|---|---|
| Primary focus | RAG pipeline quality (retrieval + generation) | Attribution and explainability across all LLM app components |
| Ground truth required | No (LLM-as-judge) | No (LLM-as-judge) |
| Metric depth | 4 core metrics (context precision/recall, faithfulness, relevancy) | Modular feedback functions; groundedness, relevance, and custom |
| Framework integrations | API-first, framework-agnostic | LangChain, LlamaIndex, direct API |
| Explainability | Metric scores per example | Component-level attribution for every output |
| Cost profile | One LLM call per metric per example | One LLM call per feedback function per example |
| Best for | Measuring RAG system health in aggregate; comparing retrieval strategies | Debugging specific outputs; understanding component contributions; detailed audit trails |
The two frameworks are not mutually exclusive. Many production systems run both: RAGAS for aggregate pipeline health metrics and trend tracking, TruLens for per-example debugging and deep investigation of flagged outputs. The operational overhead is manageable when evaluation is batched and sampling is used to control costs.
Beyond RAG: Evaluating General LLM Outputs
RAGAS is purpose-built for retrieval-augmented pipelines. For general LLM applications -- completion tasks, chat interfaces, code generation -- evaluation requires a different set of approaches.
LLM-as-judge is the most widely applicable pattern. The idea is simple: use a capable model (GPT-4, Claude 3.5 Sonnet) to evaluate the outputs of a target model. The judge model receives the input, the target model's output, and an evaluation prompt that defines the scoring rubric. It outputs a score with justification. This works for dimensions like response quality, coherence, helpfulness, and safety -- anything that can be described in an evaluation prompt.
Reference-free benchmarks like SWE-Agent for code and HELM for general language understanding provide standardized evaluation against curated datasets. These are useful for tracking model capability over time and comparing models, but they do not map directly to your application's specific distribution.
Behavioral testing evaluates LLM outputs against a suite of test cases -- similar to traditional unit testing but with outputs that require judgment rather than exact matching. This is particularly effective for safety-critical applications: test cases that represent known harmful outputs, sensitive topic handling, and jailbreak attempts. The framework runs inputs through the system and checks whether outputs violate defined constraints.
For teams running multiple models -- routing requests between GPT-4, Claude, and open-source models -- evaluation also serves as a model selection signal. The same prompt may perform differently across models, and evaluation data collected across production traffic can inform routing decisions or multi-model fallback strategies.
Evaluation in CI/CD: Making It Actually Happen
The most common failure mode for LLM evaluation programs is treating evaluation as a research project rather than an operational system. Evaluation runs happen sporadically, results are reviewed manually, and the insights never make it back into deployment decisions.
Making evaluation stick requires embedding it into deployment workflows:
Pre-deployment evaluation gates. Before any model update or significant configuration change, run the evaluation pipeline against the full evaluation dataset. Block the deployment if scores drop below configured thresholds. This is the single most effective pattern for maintaining output quality over time -- it prevents regressions from reaching production rather than reacting to them after the fact.
Regression testing on model updates. When a model provider (OpenAI, Anthropic, Google) releases a new model version, run the evaluation pipeline against the new version before routing production traffic to it. Model updates can introduce subtle behavioral changes that are not catches by provider release notes.
Scheduled drift detection. Run evaluation on a sample of production traffic on a fixed schedule (weekly at minimum). Track scores over time. When a sustained downward trend appears -- even if no single run crosses a threshold -- investigate and document whether the change is acceptable or requires intervention.
Evaluation scoreboards. Make evaluation scores visible. Dashboards showing evaluation score trends over time, broken down by metric and application, create accountability. When scores are visible, teams are more likely to address regressions proactively.
What to Evaluate When You Are Just Starting
If you are building your first evaluation program, do not try to evaluate everything at once. Start with the metric that is most directly connected to your application failing in a way that is visible to users.
For RAG systems, that is almost always faithfulness or answer groundedness. Hallucinated citations and fabricated claims are the most visible failure mode for retrieval-augmented systems -- they are the outputs most likely to reach users and cause problems. RAGAS faithfulness scoring or TruLens groundedness feedback gives you a direct signal on this.
For chat interfaces and completion systems, start with a general quality metric: something that captures whether outputs are helpful, relevant, and free of obvious errors. LLM-as-judge with a simple rubric ("rate this response 1-5 on helpfulness and factual accuracy") gives you a quick proxy without requiring complex framework setup.
As the evaluation practice matures, expand to additional dimensions: retrieval precision, latency, safety, coherence. Build the evaluation dataset incrementally. The goal is a feedback loop that gets tighter over time -- faster identification of regressions, more precise localization of failure modes, better coverage of edge cases.
Tools Summary
| Framework | Type | Best For | Cost |
|---|---|---|---|
| RAGAS | Retrieval + generation evaluation | RAG pipeline health tracking, retrieval strategy comparison | Free (open-source); LLM costs for scoring |
| TruLens | Attribution + feedback | Debugging specific outputs, component-level attribution, audit trails | Free (open-source); LLM costs for scoring |
| Weights & Biases Weave | Experiment tracking + evaluation | End-to-end evaluation pipelines, score logging, regression tracking | Free tier; paid plans for teams |
| Arize Phoenix | Observability + evaluation | Production LLM tracing with integrated evaluation metrics | Free tier; paid for production scale |
Conclusion
Evaluation is the discipline that separates LLM experiments from LLM products. Without systematic evaluation, you are flying blind -- relying on spot checks and user reports to identify when your system stops working correctly. With evaluation frameworks like RAGAS and TruLens embedded into your deployment workflows, you have objective, continuous signal on whether your system is maintaining the quality your users expect.
Start with one metric. Pick the failure mode that matters most to your application -- for most RAG systems, that is faithfulness. Set up a basic evaluation run this week. Get the score baseline. Configure a simple alert that fires when the score drops below that baseline. Then expand from there -- more metrics, tighter thresholds, pre-deployment gates, evaluation scoreboards.
The goal is not a perfect evaluation system. The goal is a feedback loop that is good enough to catch regressions before they reach users and to give your team the signal they need to improve the system over time. RAGAS and TruLens make that achievable without requiring a dedicated evaluation team.