Built by Berry — Operational AI
Menu

Navigation

Built by Berry is the operational AI firm — we ship the systems, the agents, and the training to run them.

Start a Project
Production AI · Part 1 of 1 Explainer Engineering and operations leads building agent workflows

The reviewer console is where humans belong

8 min read Published 2026-04-16 Updated Jun 14, 2026

Operational AI that works in production does not eliminate humans. It moves them to the right place. The model handles the pattern — the cases that look like cases you have seen before. The human handles the exception — the disputed invoice, the contract with non-standard terms, the customer flagged for escalation, the output where confidence dropped below the threshold you trust. The interface where that split happens is the reviewer console — and it is the difference between a workflow your ops team runs and one your engineering team babysits.

If you are arriving here from workflow design work, the Expert in the Loop series explains why control points exist — from thought-partner framing through generate/decide splits — before this article shows how to implement them in production software.

#Automation handles the pattern; humans handle the exception

Every real operational workflow has a boundary. Inside the boundary, the rules are clear enough for a system to execute. Outside the boundary, a person with context and authority needs to decide.

Support triage is a useful example. The agent reads the ticket, pulls CRM and billing context, drafts a response, and routes it. Standard billing questions, password resets, and status inquiries are inside the boundary. A customer threatening to cancel a seven-figure contract after a data breach is outside it. The agent should not auto-send a templated reply. It should surface the case to someone who can make a judgment call.

Record reconciliation shows the same split. The agent compares records across systems, surfaces drift, and drafts the fix. Penny-level rounding differences are inside the boundary. A mismatch that implies revenue recognition error is outside it. The fix draft lands in the console. A human confirms or escalates.

The mistake teams make is treating human review as a temporary measure — something you will remove once the model gets good enough. That is backwards. The exceptions are not a failure of the model. They are a feature of the business. Contracts vary. Clients have history. Edge cases are edge cases because they require judgment. The goal is not zero human involvement. The goal is human involvement only where judgment is required, with full context when it is.

flowchart LR input[Case arrives] --> boundary{Inside boundary?} boundary -->|Yes| auto[Automated path] boundary -->|No| console[Reviewer console] auto --> log[Audit log] console --> human[Human action] human --> log

#What a reviewer console actually is

A reviewer console is not a chat window. It is not a shared inbox. It is not a Slack channel where someone posts "please check this."

It is an operational screen where a named person sees a queue of cases that need human attention. For each case, they see what the AI produced, what inputs it used, why this case was routed for review, and what actions they can take: approve, edit, reject, escalate, reassign.

The console answers four questions for every item in the queue.

What happened? The inputs retrieved, the model output, the timestamp, the run identifier. Not a summary of what happened — the actual artifacts.

Why is this here? The routing rule that sent it: confidence below threshold, dollar amount above limit, keyword match, missing required field, client tier requiring manual review. The human should not have to guess why they are looking at this case.

What can I do? Clear actions with consequences. Approve sends the output to its destination. Edit lets the human fix the output before it ships. Reject stops the workflow and records why. Escalate moves it to someone with more authority.

What happens next? After the human acts, the workflow continues or stops. The case leaves the queue. The audit log records the decision.

If your ops team has to ask engineering to answer any of those four questions, you do not have a reviewer console. You have a dependency.

Question What the console must show
What happened? Inputs, model output, run ID — the artifacts
Why is this here? Routing rule that sent it, not guesswork
What can I do? Approve, edit, reject, escalate — with consequences
What happens next? Workflow continues or stops; case leaves queue

#Exception routing with rules, not heroics

Exceptions need routing rules written down, not judgment calls by whoever happens to be online.

Define what sends a case to the console before you ship. Dollar thresholds. Client tiers. Confidence scores below a cutoff. Keyword or category matches. Missing required fields. Combinations of the above. Put those rules in configuration that operations can adjust within bounds engineering sets — not buried in a prompt or hard-coded in a script nobody wants to touch.

Good routing is conservative where the cost of a wrong automatic action is high. A draft contract sent to the wrong client is worse than a contract held in the queue for an extra hour. A support reply auto-sent during an escalation is worse than a delayed reply reviewed by a human. Set thresholds accordingly.

Good routing is also specific about ownership. Each queue has a named owner or a named rotation. Unowned queues become queues nobody clears. Cases age. SLAs slip. Leadership asks why the AI investment is not delivering speed. The AI delivered speed on the pattern cases. The exception cases rotted because nobody was watching the console.

Bad routing sends everything to the console. If eighty percent of cases need human review, you have built a manual workflow with extra steps. Tune the boundary. Adjust confidence thresholds. Fix the data layer so the model has better inputs. The console is for exceptions, not for the ordinary.

Bad routing sends exceptions to Slack. Slack messages are not a queue. They do not sort by priority. They do not record structured decisions. They do not survive a handoff when the engineer who built the pilot moves on. Pull exceptions out of Slack into a system with ownership, audit, and SLA — the same way you would pull approvals out of Slack into an approvals system.

#Audit: the Friday afternoon test

Someone will need to debug a wrong answer. The question is whether that someone is an engineer with log access or an ops lead with a screen.

The Friday afternoon test: open the last case that produced a bad outcome. Reconstruct what happened in under five minutes. What inputs did the model see? What did it produce? What routing rule sent it to review? What did the human do? What shipped to the customer or the downstream system?

If you cannot pass that test, the console is not production-ready regardless of how good the model output looks on happy-path demos.

Audit means every run stores its input snapshot, its output, the prompt or rules version used, the routing decision, and the human action taken. Audit means those records are searchable by case identifier, date, client, or workflow step — not grep across three log files.

Audit is not bureaucracy for its own sake. It is how you answer the question your leadership will ask after the first visible failure: "What happened, and how do we prevent it?" Without audit, the honest answer is "we do not know," and the next action is "shut it down until engineering investigates."

The firms that pass the Friday afternoon test treat audit as a first-class feature shipped alongside the model call — not a follow-up ticket for next quarter.

#Architecture cues for engineering leads

The reviewer console sits between the agent and the real world. A few architectural decisions keep it maintainable.

State before intelligence. Store the workflow state — pending, running, awaiting review, approved, rejected, failed — in your application database, not in the model's memory. The model is stateless. Your system is not.

Queues for both sides. The model runs on a job queue. Human review items sit in a review queue. Both queues have owners, both have SLAs, both have visibility. A case should never exist only in a worker's memory.

Input snapshots. When a run starts, freeze the inputs it used. If the CRM record changes after the draft was generated but before the human approved it, you need to show both versions and flag the drift. Approving stale output is a production failure mode that audit catches.

Action logging. Every human action in the console writes a record: who, what, when, which case, what changed. "Approved" is not enough. "Approved with edit" should show the diff.

Configuration, not prompts, for business rules. Routing thresholds, client tiers, exclusion lists, and approval limits belong in configuration the ops owner can adjust. Prompts are for task framing and output format. When business rules live in prompts, every policy change requires an engineer.

Idempotent downstream actions. If a human approves a case and the send job retries, it should not double-send. Design approval-to-action handoffs so repeating them is safe.

These are not exotic patterns. They are the same infrastructure you would build for any multi-step operational workflow. The AI call is one step. The reviewer console is another. The architecture treats both as production software.

#What to build before the next model upgrade

If you have an agent workflow in production — or one about to graduate from pilot — audit the human handoff before you touch the model.

Does a named person own the review queue? Can they clear cases without asking engineering? Can they answer the Friday afternoon test on the last bad outcome in under five minutes? Do routing rules live in configuration, or in a prompt only engineering can edit?

If any answer is no, that is the build priority. A better model on a broken handoff produces better wrong answers faster.

Ship the console queue first. Connect it to the agent's exception routing. Store input snapshots and action logs from day one. Train the ops owner on the screen, not on the prompt. Then upgrade the model — and you will be able to tell whether accuracy improved, because you have the audit trail to measure it.

Edit this article on GitHub