Anonymized composite (Northline B2B)—multiple implementations. Part 1 covers the move from vibe prompting to structured support-reply-v3.
Situation (week 12)
Northline completed a twelve-week pilot on tier-2 support assist at fifty percent shadow traffic. Median handle time on the assisted queue was roughly eighteen percent lower; CSAT rose six to nine points; eval pass rate held at or above ninety-two percent on twenty-five held-out cases. Executive sponsor asked to "flip the whole queue" before quarter end. Process owner asked for evidence that eighty percent coverage would not reintroduce the policy near-misses that justified human send in v1.
The team had a working prompt registry, eval checklist, and monthly risk forum rhythm from Part 1. Part 2 is the story of scaling coverage without scaling risk faster than governance.
Approach
Leadership agreed to increase shadow traffic in two steps—sixty-five percent, then eighty percent—only when weekly pass rate and override review met pre-declared gates. No new tools. No model swap during expansion (tempting for sponsor optics; blocked by process owner until eval set re-run).
Eval expansion: Held-out set grew from twenty-five to forty cases—fifteen drawn from override reasons in weeks 8–12 (VIP language, deprecated refund phrasing, wrong product SKU). Fail criteria unchanged: policy violation, wrong fact, missing escalation.
Override review: Support ops lead reviewed every override >2 minutes with categorization tag. Tags fed eval set within seven days per evaluation hooks discipline.
Logging: policy_pack_version and prompt_id version already mandatory from Part 1; added shadow_traffic_pct to audit rows for replay during expansion debates.
Gate table (what had to be true)
| Gate | 50% → 65% | 65% → 80% |
|---|---|---|
| Weekly pass rate (40 cases) | ≥92% for 2 weeks | ≥92% for 3 weeks |
| Policy violations on eval | 0 | 0 |
| New override categories | Added to eval within 7 days | Same |
| Risk forum vote | Yes | Yes |
| Rollback drill | Prior registry pin <30 min | Repeated |
Forum rejected a request to skip the sixty-five percent step. That delay cost two weeks of sponsor enthusiasm—and prevented a VIP escalation miss at seventy-two percent traffic that became eval case #38.
Results (ranges)
Metrics measured on the assisted tier-2 queue only; unassisted queues excluded to avoid mixing effects.
| Metric | At 50% (wk 12) | At 80% (wk 20) |
|---|---|---|
| Median handle time vs baseline | ~18% lower | ~20% lower |
| CSAT on assisted queue | +6–9 pts | +7–10 pts |
| Weekly eval pass rate | 92–94% | 91–93% |
| Policy-related escalations | down sharply | stable |
| Override rate | monitored | slight rise; reviewed |
Pass rate dipped one point at eighty percent—expected with harder case mix—not treated as failure because zero policy violations held and override tags explained variance (wording preference, not factual errors).
Lessons
Stepwise traffic beats big-bang. Sponsors remember two extra weeks; Legal remembers zero new policy incidents.
Override review is eval fuel. Teams that skip categorization recreate vibe prompting through "helpful" human edits that never become cases.
Activity metrics stayed off promotion criteria. Drafts generated per day rose; that number was reported for transparency only.
Registry rollback drill mattered. Week 16 staging mistake loaded wrong hash; rollback to 1.4.2 took nine minutes because prod pins were real.
What they would do differently
Start the forty-case eval set at week 8 instead of week 14—earlier hard cases would have surfaced VIP language gaps before sixty-five percent traffic.
Publish shadow traffic percentage in the agent UI earlier; reps trusted the system more when coverage was visible.
What to do Monday
If you are at pilot pass rate but leadership wants full queue coverage:
- Copy the gate table into your risk forum agenda.
- Add fifteen override-driven cases to eval before the next traffic bump.
- Run a registry rollback drill before increasing shadow percent.
- Block model or retrieval tier changes during expansion windows.
Where to go next
Part 1 case study for foundation. Prompt registry playbook for release discipline. RAG in Production when retrieval tier changes are proposed during scale. 10 Signs Your Company Is Vibe Prompting if side pilots reappear while you scale.