Anonymized composite from multiple implementations.
Situation
A 120-person B2B company had copilots in support, sales, and marketing. Leadership saw activity metrics rise; CSAT flatlined. Support leads spent evenings fixing AI replies.
Approach
- Paused new tool trials for 90 days.
- Selected one workflow: suggested replies on tier-2 tickets.
- Built context from 40 KB articles tagged
customer-safe. - Added checker step for unsupported claims.
- Required human send; logged overrides.
Results (ranges)
| Metric | Before (8 wk avg) | After (12 wk pilot) |
|---|---|---|
| Median handle time | baseline | ~18% lower |
| CSAT on assisted queue | flat | +6–9 pts |
| Escalations from wrong policy | frequent | down sharply |
| Reproducibility across agents | low | high on eval set |
Lessons
- Diagnostics (10 Signs) focused the team on one process instead of debating tools.
- Governance was lightweight but named—see roles guide.
- Model changes mattered less than context and eval discipline.
- The program succeeded because leadership protected the pilot from scope creep for ninety days.
What they would do differently
Start with eval cases before writing prompts. Involve support leads in context tagging week one, not week six. Publish a simple change log when context packs update so agents know which policy version they are running against.
Next step for readers
If your team mirrors this story—strong activity, weak reproducibility—run the diagnostic, pick one queue, and fill the workflow canvas before the next vendor demo.