Most teams evaluate agents with one number because one number is easy to present. Usually it is accuracy, or cost per run, or "users like it." In production, that shortcut fails. An agent can be cheap and unsafe, fast and wrong, accurate and too fragile to survive traffic spikes.
CLEAR is a practical way to score an agent as an operating system, not a demo artifact. It stands for Cost, Latency, Efficacy, Assurance, and Reliability. Together, these five dimensions make trade-offs explicit before incidents force them.
If your team already uses workflow-level gates from Evaluation Hooks for AI Workflows, CLEAR gives you the scorecard to decide whether a workflow should stay in pilot, scale, or roll back.
Why single-metric evaluation breaks
Executive dashboards compress agent performance into one green number because committees prefer simplicity. In production, that compression hides trade-offs until they become incidents—a routing agent that gains accuracy by dropping policy checks, or a drafting agent that cuts cost by skipping retrieval. CLEAR exists because agents are systems: prompts, tools, context packs, escalation rules, and human gates. You cannot govern a system with a single dial. Northline's steering group stopped approving scale requests backed by accuracy alone after a checker bypass nearly reached pilot traffic.
Single-metric reporting creates false confidence:
- "Accuracy improved" hides a 2x latency jump that breaks SLA.
- "Cost dropped" hides a policy-check step that was quietly removed.
- "User acceptance is high" hides brittle behavior after a model update.
Agents are systems of prompts, tools, context, and escalation rules. You need a system score.
The CLEAR dimensions
Each CLEAR dimension answers a different operator question—economic defensibility, timing reliability, task usefulness, policy alignment, and stability over time. Weakness in any one dimension should block or narrow scale even when others look strong. Score dimensions separately with distinct data sources; averaging them into a composite hides the failure mode you will regret in production. Northline reviews all five weekly for every customer-facing workflow in pilot, using the same scorecard template whether the agent is single-step or multi-agent.
C — Cost
Measure end-to-end operating cost per completed workflow outcome, not just model tokens.
Include:
- model tokens and embeddings
- retrieval and tool/API calls
- orchestration overhead
- human review time for escalations
- rework cost from failed runs
Operator question: Is this workflow economically defensible at expected volume, including human fallback?
L — Latency
Measure p50 and p95 latency from trigger to usable output, including queueing and human gates.
Include:
- model inference time
- retrieval/tool round trips
- retry overhead
- wait time at human approval checkpoints
Operator question: Does this workflow meet business timing needs consistently, not just in happy paths?
E — Efficacy
Measure whether outputs solve the business task on a held-out eval set and in live pilot traffic.
Include:
- task success rate on frozen eval set
- failure taxonomy (fact, policy, format, action)
- override rate in pilot
- business KPI proxy (for example, first-pass resolution)
Operator question: Is the agent actually producing useful outcomes, not convincing text?
A — Assurance
Measure policy alignment, auditability, and control quality.
Include:
- policy/checker pass rate
- rate of blocked unsafe actions
- completeness of audit fields per run
- evidence that changes passed approval and release discipline
Assurance is where many "high-performing" systems fail governance review. Pair this with your release and audit design from Evaluation Hooks for AI Workflows.
R — Reliability
Measure stability over time and under operational variance.
Include:
- incident rate per 1,000 runs
- retry success/failure ratio
- drift after model/context changes
- degradation under peak load
Operator question: Can this workflow be trusted next month, not only this week?
CLEAR scorecard template
The scorecard is a decision artifact, not a reporting vanity table. One row per workflow, updated weekly, forces sponsors to name which dimension is weak before approving promotion. Include the decision column explicitly—scale, hold, rollback, or narrow scope—so "we are monitoring" does not substitute for accountable choice. Northline pins the scorecard to their workflow changelog; when assurance drops after a policy pack update, the row shows who held rollout and what remediation was assigned.
Use one row per workflow, updated weekly:
| Workflow | Cost | Latency | Efficacy | Assurance | Reliability | Decision |
|---|---|---|---|---|---|---|
| Support triage | 0.83 EUR/run | p95 11.2s | 91% pass | 98% policy pass | 0.7 incidents/1k | Scale pilot |
| Outreach draft | 0.42 EUR/run | p95 6.1s | 87% pass | 92% policy pass | 1.8 incidents/1k | Hold |
The point is not a vanity average. The point is gating decisions with explicit weaknesses.
Suggested thresholds by stage
Thresholds should reflect workflow risk tier, not company-wide averages. A contract-review agent needs stricter assurance gates than an internal meeting summarizer; a customer-facing router needs tighter latency SLOs than a batch reporting job. Start with the table below as defaults, then calibrate with process owners and governance after two pilot weeks of real traffic. Document changes in the workflow registry so auditors can see why a gate moved—not just that it moved.
You can start with simple stage gates:
| Stage | Cost | Latency | Efficacy | Assurance | Reliability |
|---|---|---|---|---|---|
| Smoke | Tracked only | Tracked only | >=80% | No critical policy failures | No blocking incidents |
| Pilot | Within target band | Meets p95 SLA | >=90% | >=95% policy pass | <=2 incidents/1k |
| Scale | Stable month-over-month | Meets p95/p99 SLO | >=92% sustained | >=98% policy pass | <=1 incident/1k |
Keep thresholds workflow-specific. A contract-check agent should carry stricter assurance requirements than an internal summarization helper.
Operating rhythm: weekly CLEAR review
CLEAR only works when it is ritualized. A 30-minute weekly review with process owner, ops, and IT beats a quarterly deck that arrives after incidents. The agenda is fixed: read the scorecard, rank failure modes, assign one action per weak dimension, and log the release decision. Northline runs this review every Thursday for pilot workflows; escalations that breach assurance or reliability thresholds trigger an ad-hoc hold before the next scheduled session.
Run a 30-minute weekly review with process owner, ops, and IT:
- Read last week's CLEAR row and incidents.
- Review top three failure modes by count and severity.
- Decide one action per weak dimension (for example, retrieval tuning for efficacy, cache for latency).
- Confirm release decision: promote, hold, rollback, or narrow scope.
- Log decisions and owners in the workflow changelog.
This turns evaluation into operations, not ceremony.
Example: why CLEAR prevents false scale
False scale is the most expensive eval failure: high headline metrics with hidden weakness in assurance or reliability. The routing agent example below is typical—leadership sees 94% efficacy and assumes readiness, while policy-check gaps and retrieval drift wait for production volume to expose them. CLEAR forces those weaknesses onto the same slide as the success metric. Northline uses the same pattern when sponsors push to expand support-reply-v3 before assurance recovers from a policy language change.
A team had excellent efficacy (94%) on a routing agent and wanted to scale. CLEAR exposed two blockers:
- assurance was 91% because checker rules missed new policy language
- reliability spiked to 2.4 incidents/1,000 runs after a retrieval index refresh
Without CLEAR, they would have scaled and learned in production. With CLEAR, they held rollout, patched policy checks, and re-ran pilot.
Common implementation mistakes
Teams adopt CLEAR, then undermine it with measurement shortcuts—model cost only, average latency only, eval sets that drift without version control. Each mistake below produces a scorecard that looks healthy while operating risk compounds. Review this list during your first monthly retrospective and assign owners to fix data sources, not just to "watch" metrics. Northline's ops lead rejected a dashboard that averaged CLEAR dimensions into a single KPI; the weekly row format stayed separate per dimension.
- Tracking model cost only; ignoring human review and rework costs.
- Reporting average latency only; hiding p95 spikes.
- Mixing eval set versions so efficacy appears to "improve" without proof.
- Treating assurance as legal paperwork instead of measurable controls.
- Declaring reliability "good" after one calm week.
What to do Monday
- Pick one production or pilot workflow.
- Define one metric per CLEAR dimension with owner and data source.
- Add stage thresholds for smoke, pilot, and scale.
- Start a weekly 30-minute CLEAR review.
If your team already runs eval hooks, CLEAR is the missing decision layer. It helps leadership answer the only question that matters in production: should we scale this workflow now, and can we defend that decision later?
For copy-paste smoke, pilot, and scale gates, use the AI Workflow Eval Checklist. For multi-agent systems, pair CLEAR with Multi-Agent Observability so reliability metrics include handoff failures, not only single-agent runs.