Building a Production LLM Evaluation Pipeline: How We Caught 12 Regressions Before They Reached Users
A three-phase evaluation framework - 200-item golden dataset, automated pipeline with LLM-as-Judge, and daily production monitoring - reduced mean time to detect quality issues from 14 days to under 6 hours.
Arko IT Services ·
Before the evaluation pipeline existed
For the first year of running production LLM systems at an AI-powered loyalty and gifting platform, the deployment process felt fine: review the code, test the integration, deploy to staging, confirm the basic workflow, deploy to production.
The trouble was hiding in the word "basic." The system worked. Nobody was measuring whether it worked well.
The production system was a RAG-based AI agent doing financial document extraction, loyalty recommendation, and AI-assisted forecasting. It changed constantly: prompt tweaks, model version bumps, retrieval config changes, new document types dropped into the knowledge base.
Over a six-month period, before the evaluation framework existed, roughly 14 days of degraded output reached users. In a financial context, 14 days of degraded outputs feed into reports and decisions in ways that are hard to quantify and genuinely uncomfortable to think about.
Building the evaluation framework: the three-phase approach
Phase 1: the golden dataset
The first thing built was not a pipeline. It was a dataset.
We set aside two weeks of engineering time to build a curated evaluation dataset of 200 query-response pairs. A pipeline without a calibrated dataset is not an evaluation system. It is a measuring rig with no idea what its ruler reads.
The 200 items broke down as follows.
Routine cases, 120 items: the most common query patterns from production logs, across all three use cases (financial extraction, loyalty recommendation, forecasting).
Edge cases, 50 items: queries from production logs that had previously produced wrong or low-confidence outputs, unusual document formats, and queries sitting right at the edge of what the system claims to handle.
Adversarial cases, 30 items: queries built to provoke specific failure modes. Leading questions that invite hallucination, deliberately ambiguous intent, and context that contains conflicting information.
Phase 2: the automated evaluation pipeline
graph TD
subgraph TRIGGER["Evaluation Triggers"]
PR_MERGE[PR Merged to Main]
MODEL_UPDATE[Model Version Update]
PROMPT_CHANGE[Prompt Configuration Change]
DAILY_CRON[Daily Production Sample - 5% of volume]
end
subgraph EVAL_RUNNER["Evaluation Runner"]
TRIGGER_HANDLER[Trigger Handler]
DATASET_LOADER[Golden Dataset Loader]
BATCH_INF[Batch Inference - production system]
JUDGE[LLM-as-Judge - GPT-4 with eval rubric]
SCORER[Multi-Dimension Scorer]
end
subgraph SCORING["Scoring Dimensions"]
CORR_S[Correctness Score - 0 to 1]
FAITH_S[Faithfulness Score - 0 to 1]
REL_S[Relevance Score - 0 to 1]
LAT_S[Latency Percentiles - P50/P90/P99]
end
subgraph OUTPUT["Evaluation Output"]
WEIGHTED[Weighted Composite Score]
COMPARE2[Delta vs. Previous Baseline]
REGRESS{Regression Check}
REPORT[Eval Report - Slack + Dashboard]
end
PR_MERGE --> TRIGGER_HANDLER
MODEL_UPDATE --> TRIGGER_HANDLER
PROMPT_CHANGE --> TRIGGER_HANDLER
DAILY_CRON --> TRIGGER_HANDLER
TRIGGER_HANDLER --> DATASET_LOADER
DATASET_LOADER --> BATCH_INF
BATCH_INF --> JUDGE
JUDGE --> SCORER
SCORER --> CORR_S
SCORER --> FAITH_S
SCORER --> REL_S
SCORER --> LAT_S
CORR_S --> WEIGHTED
FAITH_S --> WEIGHTED
REL_S --> WEIGHTED
LAT_S --> WEIGHTED
WEIGHTED --> COMPARE2
COMPARE2 --> REGRESS
REGRESS -->|>3% decline| REPORT
REGRESS -->|Within bounds| REPORT
The composite score weighted dimensions by business risk: correctness (40%), faithfulness (35%), relevance (20%), latency (5%). Any change that dropped the score 3% or more against the prior baseline was flagged as a regression and had to be reviewed before going to production.
Phase 3: the production monitoring loop
On-change evaluation runs on every PR merge, model update, or prompt config change, with results back within 20 minutes of the trigger.
Daily production sample evaluation samples 5% of production traffic, anonymizes it, and scores it against the same rubric. This is what caught quality drift from input distribution changes: cases where production queries were drifting away from the golden dataset in ways that hurt quality.
In practice, the daily sample evaluation was the more valuable of the two.
The 12 regressions we caught
| # | Trigger | Root Cause | Severity |
|---|---|---|---|
| 1 | Model version update | Output format changed - JSON schema non-compliance increased 18% | High |
| 2 | Prompt change - financial extraction | Chain-of-thought instruction removed - faithfulness declined 8% | High |
| 3 | RAG config change - chunk size increased | Longer chunks reduced retrieval precision - relevance declined 11% | High |
| 4 | New document type added | Embedding distribution shift - retrieval for existing doc types degraded | Medium |
| 5-7 | Three prompt variations tested A/B | Each evaluated lower than baseline on at least one dimension | Medium |
| 8 | Dependency library update | Tokenization change altered context assembly - subtle correctness decline | Low |
| 9 | Distribution drift (daily sample) | New partner transaction data format changed - extraction accuracy fell 12% | High |
| 10 | Distribution drift (daily sample) | Seasonal vocabulary shift in loyalty queries | Low |
| 11 | Prompt change - recommendation engine | Inadvertently reduced diversity of recommendations | Medium |
| 12 | Temperature config change | Higher temperature increased output variance - faithfulness SD doubled | Medium |
Regressions 1 through 3 were the worst of the set. A GPT model version update that silently breaks output format compliance is exactly the kind of change that quietly produces financial reporting errors for weeks if no evaluation gate is watching. We caught it in 20 minutes.
The organizational impact
Before the pipeline, every prompt change or model update felt like a gamble. Engineers moved cautiously because they could not predict the quality impact. Iteration was slow.
After the pipeline, a change could be proposed, evaluated, and accepted or rejected in under an hour. Iteration sped up because the uncertainty had been swapped for data.
Mean time to detect a quality issue fell from 14-plus days, when it depended on someone noticing, to under 6 hours for on-change regressions and under 24 hours for distribution drift.
What it cost to build this
- Initial dataset construction: about 80 person-hours over two weeks
- Evaluation pipeline development: about 3 engineering weeks
- Ongoing operational cost: about 2 hours a week of dashboard review
The math works out. If those 12 regressions had reached production, each would have taken roughly 3 days to diagnose and fix after the fact. That is about 36 engineering days of prevented cleanup against 3 weeks of build cost.