Northline Part 2

Anonymized composite (Northline B2B)—multiple implementations. Part 1 covers the move from vibe prompting to structured support-reply-v3.

Situation (week 12)

Northline completed a twelve-week pilot on tier-2 support assist at fifty percent shadow traffic. Median handle time on the assisted queue was roughly eighteen percent lower; CSAT rose six to nine points; eval pass rate held at or above ninety-two percent on twenty-five held-out cases. Executive sponsor asked to “flip the whole queue” before quarter end. Process owner asked for evidence that eighty percent coverage would not reintroduce the policy near-misses that justified human send in v1.

The team had a working prompt registry, eval checklist, and monthly risk forum rhythm from Part 1. Part 2 is the story of scaling coverage without scaling risk faster than governance.

Approach

Leadership agreed to increase shadow traffic in two steps—sixty-five percent, then eighty percent—only when weekly pass rate and override review met pre-declared gates. No new tools. No model swap during expansion (tempting for sponsor optics; blocked by process owner until eval set re-run).

Eval expansion: Held-out set grew from twenty-five to forty cases—fifteen drawn from override reasons in weeks 8–12 (VIP language, deprecated refund phrasing, wrong product SKU). Fail criteria unchanged: policy violation, wrong fact, missing escalation.

Override review: Support ops lead reviewed every override >2 minutes with categorization tag. Tags fed eval set within seven days per evaluation hooks discipline.

Logging: policy_pack_version and prompt_id version already mandatory from Part 1; added shadow_traffic_pct to audit rows for replay during expansion debates.

Gate table (what had to be true)

Gate	50% → 65%	65% → 80%
Weekly pass rate (40 cases)	≥92% for 2 weeks	≥92% for 3 weeks
Policy violations on eval	0	0
New override categories	Added to eval within 7 days	Same
Risk forum vote	Yes	Yes
Rollback drill	Prior registry pin <30 min	Repeated

Forum rejected a request to skip the sixty-five percent step. That delay cost two weeks of sponsor enthusiasm—and prevented a VIP escalation miss at seventy-two percent traffic that became eval case #38.

Failed gate (week 15): Weekly pass rate dipped to 89% for one week at sixty-five percent traffic—below the ≥92% gate. Process owner held traffic, added twelve override-driven cases within five days, and re-ran eval before retrying. Sponsors saw a hold decision with evidence, not a narrative debate. Those cases fed the monthly Measuring AI Workflow ROI scorecard: pass rate and incident rate moved together before leadership approved eighty percent coverage.

Results (ranges)

Metrics measured on the assisted tier-2 queue only; unassisted queues excluded to avoid mixing effects.

Metric	At 50% (wk 12)	At 80% (wk 20)
Median handle time vs baseline	~18% lower	~20% lower
CSAT on assisted queue	+6–9 pts	+7–10 pts
Weekly eval pass rate	92–94%	91–93%
Policy-related escalations	down sharply	stable
Override rate	monitored	slight rise; reviewed

Pass rate dipped one point at eighty percent—expected with harder case mix—not treated as failure because zero policy violations held and override tags explained variance (wording preference, not factual errors).

Lessons

Stepwise traffic beats big-bang. Sponsors remember two extra weeks; Legal remembers zero new policy incidents.

Override review is eval fuel. Teams that skip categorization recreate vibe prompting through “helpful” human edits that never become cases.

Activity metrics stayed off promotion criteria. Drafts generated per day rose; that number was reported for transparency only.

Registry rollback drill mattered. Week 16 staging mistake loaded wrong hash; rollback to 1.4.2 took nine minutes because prod pins were real.

What they would do differently

Start the forty-case eval set at week 8 instead of week 14—earlier hard cases would have surfaced VIP language gaps before sixty-five percent traffic.

Publish shadow traffic percentage in the agent UI earlier; reps trusted the system more when coverage was visible.

What to do Monday

If you are at pilot pass rate but leadership wants full queue coverage:

Copy the gate table into your risk forum agenda.
Add fifteen override-driven cases to eval before the next traffic bump.
Run a registry rollback drill before increasing shadow percent.
Block model or retrieval tier changes during expansion windows.

Where to go next

Part 1 case study for foundation. Prompt registry playbook for release discipline. RAG in Production when retrieval tier changes are proposed during scale. 10 Signs Your Company Is Vibe Prompting if side pilots reappear while you scale.