Prompt Regression Testing

Most prompt quality regressions are predictable. A template gets “slightly improved,” a checker rule changes, a retrieval source is updated, and output quality drifts before anyone notices.

The fix is not a bigger prompt. The fix is regression testing with a stable eval set and explicit release gates. You do not need a quarter-long program to start. You can build a credible first version in one week.

This playbook gives a practical day-by-day plan. It aligns with workflow-level gating from Evaluation Hooks for AI Workflows and release discipline from The Prompt Registry Playbook. Regression discipline applies on local stacks too—not only hosted APIs; Critique Agent v0.9 shows frozen cases and release gates on Ubuntu with Ollama, Critique Agent v1.0 adds a seven-section output validation layer before SQLite persist, and Part 3 shows one audit becoming two regression tests on an external repo—tests, not auto-fixes.

What “good enough in one week” means

One week is enough for a credible regression gate when scope stays ruthless: one workflow, one prompt/checker pair, frozen cases, baseline scores, and a release rule tied to promotion—not a perfect eval lab. The deliverables below stop accidental quality decay; they do not replace long-term eval maturity from evaluation hooks. Northline built their first thirty-case set for support-reply-v3 in five working days and blocked a wording tweak two weeks later when policy pass dropped below gate.

By end of week, you should have:

one scoped workflow and one canonical prompt/checker pair
25-40 frozen test cases with expected outcomes
a simple rubric with pass/fail and failure tags
baseline scores for current production version
a release rule: changes cannot promote if regression threshold is violated

This is enough to stop accidental quality decay.

Day-by-day implementation plan

The day-by-day plan below assumes one accountable owner and no parallel workflows. Each day produces an artifact—frozen version, case set, rubric, baseline scores, release gates—that builds on the prior day without rework. Changing rubric mid-week invalidates comparisons; mixing workflows inflates pass rates with noise. Northline assigned a process owner and an engineering partner for the week; disputes on ten calibration cases were resolved on day four before baseline locked and release gates were published to the workflow changelog.

Day 1 — Scope and freeze

Pick one workflow where prompt drift is expensive (customer-facing support, compliance drafts, routing decisions).

Define:

target output contract
unacceptable failures
owner who approves scoring disputes

Freeze the current prompt version and context dependencies. Do not collect cases across multiple workflows in week one.

Day 2 — Collect realistic cases

Build a case set from real historical inputs, anonymized where needed.

Case mix target:

60% common cases
25% edge cases
15% known failure patterns

Document each case with:

input payload
expected output or expected classification
policy flags if relevant

Avoid synthetic-only datasets; they overestimate quality.

Day 3 — Define rubric and failure taxonomy

Create a practical scoring rubric:

Dimension	Rule
Task correctness	Output solves requested task
Policy compliance	No blocked claims/actions
Format contract	Required structure present
Evidence quality	Cites allowed context when required

Add failure tags:

fact_error
policy_violation
format_break
missing_context
unsafe_action

Your taxonomy should guide remediation, not only reporting.

Day 4 — Run baseline and calibrate

Run all cases against current version. Two reviewers should independently score at least 10 shared cases to calibrate consistency.

Capture:

overall pass rate
pass rate by case type
top failure tags

If reviewers disagree heavily, simplify rubric wording before continuing.

Day 5 — Define release gates

Translate results into deployment rules:

Gate	Rule
Smoke	>=80% pass on 10-case quick set
Release candidate	No critical policy failures + >=90% full-set pass
Production promotion	No regression >2 percentage points on key slice

Tie these gates to your workflow release process. No exceptions by urgency without explicit sign-off.

Minimal data model for eval cases

Ad-hoc spreadsheets fail regression programs because case IDs drift and version links disappear. Store cases with stable identifiers, workflow IDs, risk tiers, and tags so you can compare pass rates across prompt versions and spot repeated failure modes. The YAML shape below is intentionally minimal—expand only when maintenance rhythm is proven. Northline stores cases beside prompt registry entries so release tickets link prompt_version, case_set_version, and gate results in one changelog row.

Store eval cases with stable IDs:

case_id: support_017
workflow_id: support-reply-v3
input_ref: dataset/support_017.json
expected_ref: expected/support_017.json
risk_tier: medium
tags: [edge_case, policy_sensitive]

Stable IDs let you compare across versions and track repeated regressions.

How to keep maintenance lightweight

Regression sets rot when teams never add incidents or never retire obsolete cases. Weekly maintenance should stay small—two to five new cases from real failures, owner-approved retirements, monthly mix rebalance, and failure-tag review in ops meetings. Giant quarterly rebuilds feel thorough but break baseline continuity. Northline adds cases after every policy miss or format break in pilot, tags them edge_case or policy_sensitive, and rebalances monthly so common paths do not drown in edge noise.

Weekly maintenance can stay small:

Add 2-5 new cases from real incidents.
Retire obsolete cases only with owner approval.
Rebalance case mix monthly.
Review top repeated failure tags in operations meeting.

This is more effective than rebuilding giant eval sets every quarter.

Common pitfalls in week one

Week-one pitfalls usually come from ambition scope creep—multiple workflows, moving rubrics, synthetic-only cases, spreadsheet chaos. Each item below produces a pass rate that looks reassuring until production traffic disagrees. Keep version one small and strict; expand case count after the first blocked promotion proves the gate works. Northline’s proposal-assist team nearly mixed routing and drafting cases; separating workflows on day one saved a false baseline that would have approved a harmful checker change.

Mixing multiple workflows into one eval set.
Changing rubric mid-run and invalidating baseline comparisons.
Scoring only overall pass rate without failure tags.
Ignoring policy-critical failures because average score looks high.
Keeping results in ad hoc spreadsheets with no version tie.

Keep the first version small and strict.

Example one-week outcome

A team managing proposal-assist prompts built a 32-case set in five working days:

baseline pass rate: 88%
policy compliance: 96%
format contract failures concentrated in edge cases

They introduced a simple release rule: no promotion if policy pass drops or if edge-case pass declines by more than 3 points. Two weeks later, a “small wording tweak” failed gate and was held before production impact.

The value was not perfect scoring. The value was preventing silent regression.

What to do Monday

Select one workflow and name one accountable owner.
Freeze current prompt version and context snapshot.
Commit to collecting 30 realistic cases by end of week.
Publish a first-pass rubric and release threshold.

Prompt systems become trustworthy when quality is measured repeatedly, not assumed. A one-week regression setup is enough to move from anecdote to controlled improvement. Use the AI Workflow Eval Checklist for copy-paste release gates once your case set is frozen.