When AI Workflows Actually Replace the BPO Line

Brad Taylor May 21, 2026 6 min read

A stack of paper claims forms being scanned into a digital pipeline

Every AI vendor on the planet now claims to replace your BPO line. Most of those claims do not survive a quarter in production. The ones that do share a small, boring set of preconditions — and a unit-economics story that holds up when you actually do the math.

We get called in on both ends of this. Before the project starts, when a buyer is trying to figure out whether the pitch from the AI vendor is real. After the project, when an in-house team or another agency built something that worked in the demo and then quietly stopped working.

Here is the shape of an AI workflow that actually replaces document-processing labor.

1. The workflow is narrow, and the system of record is named

FNOL intake into Guidewire ClaimCenter. ACORD form processing into a carrier's underwriting system. Bills of lading into MercuryGate. Lease abstraction into Yardi or VTS. Loan file indexing into Encompass.

Every one of these is a workflow you can point at. There is a defined input (a document or a small set of document types), a defined extraction (a list of fields, with rules), and a defined output (a structured record landing in a system your team already uses).

The pitches that fail almost always fail here. "AI for insurance." "AI for healthcare back office." Those are categories, not workflows. You cannot replace a labor line you cannot point at.

2. There is an eval harness from day one

If you cannot say what the system's accuracy number is — measured against a real set of examples scored by humans — you do not have an AI workflow. You have a demo.

We ship every workflow with an evaluation harness. A few hundred to a few thousand real examples, scored against what your team would have produced. Accuracy is a number we watch every day. When a new model ships, we test against the same eval set before we migrate. When the workflow changes, we re-run the evals to catch regressions before users do.

This is the part of the work that almost nobody markets and almost nobody skips successfully. Without it, accuracy slips, and the BPO line you replaced quietly creeps back.

3. The unit economics actually work

The honest math on a BPO replacement is per-file. What is your team or vendor currently paying per file on this workflow? What can the AI workflow run at, fully loaded, today? What does the gap have to be before the savings justify the build?

On most document workflows the per-file cost at a BPO is $8 to $50, sometimes higher on complex specialties (lease abstraction, surgical coding). Fully-loaded inference cost on a tuned AI workflow is usually $0.20 to $2 per file at today's prices. The labor that does not go away is exception handling — typically 5–20% of volume routed to a human in the loop.

Net savings of 50–80% over the engagement are normal. 95% replacement is rare and is almost always the wrong target. Augmentation that compounds beats one-shot replacement that drifts.

4. Exceptions go to people who can actually fix them

Every AI workflow generates exceptions. The question is where they go. The pitches that fail dump exceptions into a queue nobody owns; the workflows that work route them to the same operators who handled the work before, in a tool they already use, with the context the AI saw.

This is mostly an operations design problem, not an AI problem. We spend more design time on exception routing than on the model. It is also where most agency projects die quietly — the AI worked, but the operating model never caught up.

Where this works, and where it does not

AI workflows replace BPO labor cleanly when: the document is structured-ish, the system of record is named, the eval harness is built, and operations is willing to redesign exception handling. They do not replace labor when the work is genuinely judgment-heavy at every step, when the documents have no template, or when nobody on your side will own the operating-model redesign.

If you are evaluating a vendor — including us — the right questions are not about the model. Ask to see the eval harness. Ask to see the per-file cost math. Ask where exceptions go. The pitches that survive those questions are the ones worth running.