Anonymized composite (Northline B2B)—multiple implementations.
Situation
After the support queue pilot succeeded, Northline's finance team asked whether AI could shorten month-end variance narratives—the paragraphs controllers attach to GL line items when actuals diverge from forecast. Controllers spent four to six hours each close rephrasing the same explanations: FX movement, timing on vendor invoices, headcount reclasses. Junior staff pasted numbers into chat tools and produced fluent narratives that sometimes cited the wrong period or wrong cost center because the model never saw the signed extract.
Leadership wanted speed; the controller wanted evidence. The team scored the effort against implementation maturity at Level 2—repeatable workflow design, not yet eval-gated production. They agreed to treat finance as a second vertical proof with stricter boundaries than support: no model sees raw bank feeds; no auto-post to the ERP; every narrative requires controller sign-off before export to the board pack.
Approach
The program cloned the support pattern—canvas, registry, eval—but tightened data boundaries per data boundaries for AI agents. Workflow ID finance-variance-v1 pulled only from controller-signed GL extracts (CSV export tagged finance-close-2026-03, hash logged). Prompt assembly followed context architecture rules: static policy (rounding, materiality thresholds) separated from dynamic line-item context.
Draft step: model produced variance narrative per line item above materiality threshold. Checker step: rule engine verified period, cost center, and currency matched extract metadata. Human gate: controller edited or rejected each narrative; rejections fed the eval set within forty-eight hours. Audit: every row logged prompt version, extract hash, and controller ID per audit trails for AI workflows.
Governance followed existing RACI from roles and ownership—Finance ops accountable, IT responsible for extract pipeline, Legal consulted on retention. Monthly risk forum reviewed override reasons; no traffic increase until pass rate held.
Eval and change control
Finance could not reuse support's twenty-five ticket eval set. The team built eighteen held-out variance scenarios with known-good narratives—fail on wrong period, immaterial line promoted, or missing FX disclosure. Smoke gate: ten scenarios, one hundred percent pass before any controller used AI drafts in production close.
Change log discipline mattered: when GL mapping changed or a new materiality rule shipped, Finance updated the prompt registry row and re-ran smoke before the next close. Controllers refused "silent upgrades" after a near-miss where an overnight model swap changed rounding language. The team adopted the change log template in AI Change Log Template for every prompt, context, and extract version bump.
Results (ranges)
Metrics focused on controller hours on narrative drafting and rework rate (narratives sent back for factual correction)—not drafts generated.
| Metric | Before (3 closes avg) | After (4 closes pilot) |
|---|---|---|
| Controller hours on narratives | baseline | ~25% lower |
| Rework rate (factual correction) | moderate | down ~40% |
| Time-to-first-draft per material line | manual | ~60% faster |
| Audit replay success | ad hoc | 100% on sample drill |
Activity metrics (tokens, drafts) were reported but excluded from promotion decisions—aligned with measuring AI workflow ROI guidance.
Lessons
Finance proved the support playbook transfers across verticals when boundaries tighten rather than loosen. Signed extracts beat "upload the spreadsheet to chat." Controller sign-off remained non-negotiable—AI shortened first drafts, not accountability. Linking registry pins to board-pack footnotes reduced audit friction in quarter two.
The team would have moved faster with the change log template from week one; ad hoc Slack announcements of prompt updates eroded trust. Starting eval scenarios before the first close workshop would have caught period-mismatch failures earlier.
What they would do differently
Involve controllers in eval case authoring in week one—not week four after the first wrong-period narrative. Pre-wire extract hash validation before any model call; retrofits cost a full close cycle. Assign a finance deputy owner before year-end close overlapped the pilot.
Next step for readers
If month-end narratives consume controller time but your data boundaries are unclear, pause tool trials, map allow/deny for GL exports, and run smoke eval on held-out scenarios before the next close. Support queue proof: From Vibe Prompting to a Structured Support Workflow. Tender pipeline variant: AI Tender Response Pipeline.