ArkoInc.
Insights/MLOps & Agent Evaluation
Case Study12 regressions caught6 min read

Building a Production LLM Evaluation Pipeline: How We Caught 12 Regressions Before They Reached Users

A three-phase evaluation framework - 200-item golden dataset, automated pipeline with LLM-as-Judge, and daily production monitoring - reduced mean time to detect quality issues from 14 days to under 6 hours.

Arko IT Services ·

Before the evaluation pipeline existed

For the first year of running production LLM systems at an AI-powered loyalty and gifting platform, the deployment process felt fine: review the code, test the integration, deploy to staging, confirm the basic workflow, deploy to production.

The trouble was hiding in the word "basic." The system worked. Nobody was measuring whether it worked well.

The production system was a RAG-based AI agent doing financial document extraction, loyalty recommendation, and AI-assisted forecasting. It changed constantly: prompt tweaks, model version bumps, retrieval config changes, new document types dropped into the knowledge base.

Over a six-month period, before the evaluation framework existed, roughly 14 days of degraded output reached users. In a financial context, 14 days of degraded outputs feed into reports and decisions in ways that are hard to quantify and genuinely uncomfortable to think about.


Building the evaluation framework: the three-phase approach

Phase 1: the golden dataset

The first thing built was not a pipeline. It was a dataset.

We set aside two weeks of engineering time to build a curated evaluation dataset of 200 query-response pairs. A pipeline without a calibrated dataset is not an evaluation system. It is a measuring rig with no idea what its ruler reads.

The 200 items broke down as follows.

Routine cases, 120 items: the most common query patterns from production logs, across all three use cases (financial extraction, loyalty recommendation, forecasting).

Edge cases, 50 items: queries from production logs that had previously produced wrong or low-confidence outputs, unusual document formats, and queries sitting right at the edge of what the system claims to handle.

Adversarial cases, 30 items: queries built to provoke specific failure modes. Leading questions that invite hallucination, deliberately ambiguous intent, and context that contains conflicting information.


Phase 2: the automated evaluation pipeline

graph TD
    subgraph TRIGGER["Evaluation Triggers"]
        PR_MERGE[PR Merged to Main]
        MODEL_UPDATE[Model Version Update]
        PROMPT_CHANGE[Prompt Configuration Change]
        DAILY_CRON[Daily Production Sample - 5% of volume]
    end

    subgraph EVAL_RUNNER["Evaluation Runner"]
        TRIGGER_HANDLER[Trigger Handler]
        DATASET_LOADER[Golden Dataset Loader]
        BATCH_INF[Batch Inference - production system]
        JUDGE[LLM-as-Judge - GPT-4 with eval rubric]
        SCORER[Multi-Dimension Scorer]
    end

    subgraph SCORING["Scoring Dimensions"]
        CORR_S[Correctness Score - 0 to 1]
        FAITH_S[Faithfulness Score - 0 to 1]
        REL_S[Relevance Score - 0 to 1]
        LAT_S[Latency Percentiles - P50/P90/P99]
    end

    subgraph OUTPUT["Evaluation Output"]
        WEIGHTED[Weighted Composite Score]
        COMPARE2[Delta vs. Previous Baseline]
        REGRESS{Regression Check}
        REPORT[Eval Report - Slack + Dashboard]
    end

    PR_MERGE --> TRIGGER_HANDLER
    MODEL_UPDATE --> TRIGGER_HANDLER
    PROMPT_CHANGE --> TRIGGER_HANDLER
    DAILY_CRON --> TRIGGER_HANDLER
    TRIGGER_HANDLER --> DATASET_LOADER
    DATASET_LOADER --> BATCH_INF
    BATCH_INF --> JUDGE
    JUDGE --> SCORER
    SCORER --> CORR_S
    SCORER --> FAITH_S
    SCORER --> REL_S
    SCORER --> LAT_S
    CORR_S --> WEIGHTED
    FAITH_S --> WEIGHTED
    REL_S --> WEIGHTED
    LAT_S --> WEIGHTED
    WEIGHTED --> COMPARE2
    COMPARE2 --> REGRESS
    REGRESS -->|>3% decline| REPORT
    REGRESS -->|Within bounds| REPORT

The composite score weighted dimensions by business risk: correctness (40%), faithfulness (35%), relevance (20%), latency (5%). Any change that dropped the score 3% or more against the prior baseline was flagged as a regression and had to be reviewed before going to production.


Phase 3: the production monitoring loop

On-change evaluation runs on every PR merge, model update, or prompt config change, with results back within 20 minutes of the trigger.

Daily production sample evaluation samples 5% of production traffic, anonymizes it, and scores it against the same rubric. This is what caught quality drift from input distribution changes: cases where production queries were drifting away from the golden dataset in ways that hurt quality.

In practice, the daily sample evaluation was the more valuable of the two.


The 12 regressions we caught

#TriggerRoot CauseSeverity
1Model version updateOutput format changed - JSON schema non-compliance increased 18%High
2Prompt change - financial extractionChain-of-thought instruction removed - faithfulness declined 8%High
3RAG config change - chunk size increasedLonger chunks reduced retrieval precision - relevance declined 11%High
4New document type addedEmbedding distribution shift - retrieval for existing doc types degradedMedium
5-7Three prompt variations tested A/BEach evaluated lower than baseline on at least one dimensionMedium
8Dependency library updateTokenization change altered context assembly - subtle correctness declineLow
9Distribution drift (daily sample)New partner transaction data format changed - extraction accuracy fell 12%High
10Distribution drift (daily sample)Seasonal vocabulary shift in loyalty queriesLow
11Prompt change - recommendation engineInadvertently reduced diversity of recommendationsMedium
12Temperature config changeHigher temperature increased output variance - faithfulness SD doubledMedium

Regressions 1 through 3 were the worst of the set. A GPT model version update that silently breaks output format compliance is exactly the kind of change that quietly produces financial reporting errors for weeks if no evaluation gate is watching. We caught it in 20 minutes.


The organizational impact

Before the pipeline, every prompt change or model update felt like a gamble. Engineers moved cautiously because they could not predict the quality impact. Iteration was slow.

After the pipeline, a change could be proposed, evaluated, and accepted or rejected in under an hour. Iteration sped up because the uncertainty had been swapped for data.

Mean time to detect a quality issue fell from 14-plus days, when it depended on someone noticing, to under 6 hours for on-change regressions and under 24 hours for distribution drift.


What it cost to build this

  • Initial dataset construction: about 80 person-hours over two weeks
  • Evaluation pipeline development: about 3 engineering weeks
  • Ongoing operational cost: about 2 hours a week of dashboard review

The math works out. If those 12 regressions had reached production, each would have taken roughly 3 days to diagnose and fix after the fact. That is about 36 engineering days of prevented cleanup against 3 weeks of build cost.

Free strategy call

Thirty minutes.
Three concrete recommendations.

We review your current technology landscape, identify your top three risks, and tell you what to do next. No deck, no commitment — just senior judgement, on the record.