Building a Production LLM Evaluation Pipeline: How We Caught 12 Regressions Before They Reached Users

Before the evaluation pipeline existed

For the first year of running production LLM systems at an AI-powered loyalty and gifting platform, the deployment process felt fine: review the code, test the integration, deploy to staging, confirm the basic workflow, deploy to production.

The trouble was hiding in the word "basic." The system worked. Nobody was measuring whether it worked well.

The production system was a RAG-based AI agent doing financial document extraction, loyalty recommendation, and AI-assisted forecasting. It changed constantly: prompt tweaks, model version bumps, retrieval config changes, new document types dropped into the knowledge base.

Over a six-month period, before the evaluation framework existed, roughly 14 days of degraded output reached users. In a financial context, 14 days of degraded outputs feed into reports and decisions in ways that are hard to quantify and genuinely uncomfortable to think about.

Building the evaluation framework: the three-phase approach

Phase 1: the golden dataset

The first thing built was not a pipeline. It was a dataset.

We set aside two weeks of engineering time to build a curated evaluation dataset of 200 query-response pairs. A pipeline without a calibrated dataset is not an evaluation system. It is a measuring rig with no idea what its ruler reads.

The 200 items broke down as follows.

Routine cases, 120 items: the most common query patterns from production logs, across all three use cases (financial extraction, loyalty recommendation, forecasting).

Edge cases, 50 items: queries from production logs that had previously produced wrong or low-confidence outputs, unusual document formats, and queries sitting right at the edge of what the system claims to handle.

Adversarial cases, 30 items: queries built to provoke specific failure modes. Leading questions that invite hallucination, deliberately ambiguous intent, and context that contains conflicting information.

Phase 2: the automated evaluation pipeline

graph TD
    subgraph TRIGGER["Evaluation Triggers"]
        PR_MERGE[PR Merged to Main]
        MODEL_UPDATE[Model Version Update]
        PROMPT_CHANGE[Prompt Configuration Change]
        DAILY_CRON[Daily Production Sample - 5% of volume]
    end

    subgraph EVAL_RUNNER["Evaluation Runner"]
        TRIGGER_HANDLER[Trigger Handler]
        DATASET_LOADER[Golden Dataset Loader]
        BATCH_INF[Batch Inference - production system]
        JUDGE[LLM-as-Judge - GPT-4 with eval rubric]
        SCORER[Multi-Dimension Scorer]
    end

    subgraph SCORING["Scoring Dimensions"]
        CORR_S[Correctness Score - 0 to 1]
        FAITH_S[Faithfulness Score - 0 to 1]
        REL_S[Relevance Score - 0 to 1]
        LAT_S[Latency Percentiles - P50/P90/P99]
    end

    subgraph OUTPUT["Evaluation Output"]
        WEIGHTED[Weighted Composite Score]
        COMPARE2[Delta vs. Previous Baseline]
        REGRESS{Regression Check}
        REPORT[Eval Report - Slack + Dashboard]
    end

    PR_MERGE --> TRIGGER_HANDLER
    MODEL_UPDATE --> TRIGGER_HANDLER
    PROMPT_CHANGE --> TRIGGER_HANDLER
    DAILY_CRON --> TRIGGER_HANDLER
    TRIGGER_HANDLER --> DATASET_LOADER
    DATASET_LOADER --> BATCH_INF
    BATCH_INF --> JUDGE
    JUDGE --> SCORER
    SCORER --> CORR_S
    SCORER --> FAITH_S
    SCORER --> REL_S
    SCORER --> LAT_S
    CORR_S --> WEIGHTED
    FAITH_S --> WEIGHTED
    REL_S --> WEIGHTED
    LAT_S --> WEIGHTED
    WEIGHTED --> COMPARE2
    COMPARE2 --> REGRESS
    REGRESS -->|>3% decline| REPORT
    REGRESS -->|Within bounds| REPORT

The composite score weighted dimensions by business risk: correctness (40%), faithfulness (35%), relevance (20%), latency (5%). Any change that dropped the score 3% or more against the prior baseline was flagged as a regression and had to be reviewed before going to production.

Phase 3: the production monitoring loop

On-change evaluation runs on every PR merge, model update, or prompt config change, with results back within 20 minutes of the trigger.

Daily production sample evaluation samples 5% of production traffic, anonymizes it, and scores it against the same rubric. This is what caught quality drift from input distribution changes: cases where production queries were drifting away from the golden dataset in ways that hurt quality.

In practice, the daily sample evaluation was the more valuable of the two.

The 12 regressions we caught

#	Trigger	Root Cause	Severity
1	Model version update	Output format changed - JSON schema non-compliance increased 18%	High
2	Prompt change - financial extraction	Chain-of-thought instruction removed - faithfulness declined 8%	High
3	RAG config change - chunk size increased	Longer chunks reduced retrieval precision - relevance declined 11%	High
4	New document type added	Embedding distribution shift - retrieval for existing doc types degraded	Medium
5-7	Three prompt variations tested A/B	Each evaluated lower than baseline on at least one dimension	Medium
8	Dependency library update	Tokenization change altered context assembly - subtle correctness decline	Low
9	Distribution drift (daily sample)	New partner transaction data format changed - extraction accuracy fell 12%	High
10	Distribution drift (daily sample)	Seasonal vocabulary shift in loyalty queries	Low
11	Prompt change - recommendation engine	Inadvertently reduced diversity of recommendations	Medium
12	Temperature config change	Higher temperature increased output variance - faithfulness SD doubled	Medium

Regressions 1 through 3 were the worst of the set. A GPT model version update that silently breaks output format compliance is exactly the kind of change that quietly produces financial reporting errors for weeks if no evaluation gate is watching. We caught it in 20 minutes.

The organizational impact

Before the pipeline, every prompt change or model update felt like a gamble. Engineers moved cautiously because they could not predict the quality impact. Iteration was slow.

After the pipeline, a change could be proposed, evaluated, and accepted or rejected in under an hour. Iteration sped up because the uncertainty had been swapped for data.

Mean time to detect a quality issue fell from 14-plus days, when it depended on someone noticing, to under 6 hours for on-change regressions and under 24 hours for distribution drift.

What it cost to build this

Initial dataset construction: about 80 person-hours over two weeks
Evaluation pipeline development: about 3 engineering weeks
Ongoing operational cost: about 2 hours a week of dashboard review

The math works out. If those 12 regressions had reached production, each would have taken roughly 3 days to diagnose and fix after the fact. That is about 36 engineering days of prevented cleanup against 3 weeks of build cost.