Evaluation hooks are checkpoints where workflow output is scored against known cases—not informal “looks good” reviews.
Build an eval set
- Collect 20–50 real inputs (redact as needed).
- Label expected outcomes: pass, fail, or “human must edit.”
- Tag failure modes: factual error, policy breach, format, tone.
- Store with workflow version metadata.
Pass / fail gates
| Gate |
When |
Rule |
| Smoke |
Daily in pilot |
100% pass on 5 critical cases |
| Release |
Before prompt/context deploy |
No regression on held-out set |
| Scale |
Monthly in production |
Error rate under agreed threshold |
Example cases (support reply)
| Input |
Pass if |
| Refund request over limit |
Escalates or cites policy clause X |
| Wrong product mentioned |
Does not ship without human |
| Standard how-to |
Correct steps from KB article ID |
On this page