Where to Point AI: A Deployment Map for Carriers

Kyle Nakatsuji·June 24, 2026·8 min read

Accuracy tells you whether a model works. It doesn't tell you where it's safe to deploy. Two questions do.

The short version (by humans, for busy humans)

Most carriers decide whether to deploy AI by asking one question: is the model accurate enough? That is the wrong question. The pilots that stall are rarely the inaccurate ones. They stall because someone pointed a capable model at a decision where being wrong is expensive and hard to undo.

Ask two questions instead.

First: if the model is wrong, how reversible is the damage? A misrouted claim gets rerouted in seconds, but a filed rate, a declined applicant, a reserve set on a large loss, a denied claim? Those carry consequences you cannot quietly take back.

Second: is a human still making the call, or is the model deciding on its own?

Put those two questions on a grid and your roadmap sorts itself. Reversible work, whether a person stays in the loop or the model runs it alone, is where you deploy now. That means triage, prefill, document synthesis, fraud flags, and the large majority of simple claims. This is where AI compounds.

High-harm, hard-to-reverse decisions with a human still deciding are copilot territory. The model sharpens the expert. It surfaces precedent, pressure-tests the read, flags what the file is missing. It does not make the call.

High-harm, hard-to-reverse decisions with the model deciding alone: don't. That is the corner where pilots die. One confident, wrong output on a rate or a denial, and you have a market-conduct problem and a senior professional who never trusts the tool again.

Accuracy tells you whether a model works. It does not tell you where it is safe to deploy. Reversibility and authority do.

That's the post. If you want a second read on where your own deployments sit on that grid, reach out.

The deployment map: reversibility on one axis, authority on the other.

The full version (human on the loop — for depth, yours or your AI's)

Picture the meeting

A chief underwriting officer is six vendor demos into the year. Every demo is sharp and they all show the same thing: documents in, decisions out, straight through, no human in the middle. The accuracy numbers are good, and they are real.

She asks each vendor the same question. Which of these decisions should the model never make on its own?

She does not get a clean answer from any of them. That silence is the whole problem, and it is the reason her last pilot is sitting in limbo instead of in production.

Accuracy is the wrong gate

Most carriers gate AI deployment on accuracy. It is the natural instinct. The pilot either hits its accuracy target or it does not, and the number feels like the decision.

Accuracy is necessary, but it does not decide whether a deployment survives contact with production. That comes down to what happens when the model is wrong, and where in the operation the wrongness lands.

Two carriers can buy the identical model and get opposite outcomes. One points it at claims triage, where a mistake is caught and corrected in minutes. The other points it at coverage decisions, where a mistake becomes a denied claim, then a complaint, then a regulator's question. Same model. Same accuracy. One compounds, one stalls. What changed was the target.

Here's what we learned

We spent nearly a decade running AI in production inside a live carrier, in auto, the shortest-tail and most structured line there is. The lesson that held up across every workflow was this: you sort deployments by reversibility and authority, not by accuracy.

The two questions that draw the map

The first question is reversibility. If the model produces a wrong output, how hard is it to undo? A misrouted claim is rerouted in seconds. A prefilled field is corrected before anyone relies on it. Those are cheap mistakes. A filed or bound rate, a declined applicant, a reserve set on a large loss, a denied claim: those are expensive, and several of them are close to irreversible once they leave the building.

The second question is authority. Is a human still the decision-maker, with the model assisting? Or is the model deciding and acting on its own?

Cross the two and you get four boxes. Walk them in order.

Reversible, human in the loop: deploy now. The augmentation layer. The model drafts, summarizes, extracts, and routes, and a person stays on the decision. Feedback is fast, because you are measuring against an answer you already have, and the cost of a miss is a few minutes of rework. Start here: document synthesis, prefill, and claims triage. It is where the returns build fastest, and it is the layer most carriers under-build, because it does not look like much in a board deck.

Reversible, model alone: deploy and monitor. Straight-through processing lives here, and it works for the large majority of simple claims: glass, simple collision, and clean first-party property. The decision is bounded, a wrong answer is cheap and catchable, and the volume is high enough that automation pays for itself. Watch the exception rate and the model drift, then let it run.

High-harm, human in the loop: copilot only. The complex work, and where the money is. A small share of claims are the complex ones, and they drive most of the loss dollars. Large-loss reserving, disputed coverage, a bodily-injury claim with an attorney attached. These are judgment calls, synthesis under uncertainty rather than lookups.

The model belongs here, but as a copilot. It surfaces the relevant precedent, pressure-tests the adjuster's read, and flags the document nobody pulled. It does not set the reserve.

Designed that way it compounds: every hard case the adjuster works alongside it makes the next one faster. Designed to overrule the adjuster instead, it does not survive its first visible miss.

High-harm, model alone: don't. The bottom-right corner is where pilots go to die. A rate, a declination, or a denial, decided and issued by a model with no human on the call.

The NAIC's Model Bulletin on the Use of Artificial Intelligence Systems by Insurers is explicit that the bar here is risk-based: the more a decision can harm a consumer, and the less a human is involved, the higher the standard you are held to. In this corner, accuracy does not save you.

"The model said so" does not survive a market-conduct exam or a bad-faith deposition, and a bad-faith loss can run past your policy limits. One confident wrong output, and you have lost the regulator and the senior professional whose trust you needed.

The bar is not impossible. It is just high, and almost no program today clears it for these decisions unattended.

So what

Before you grade another pilot on accuracy, plot it. Take your current AI roadmap, every initiative on it, and drop each one into a box. Sort by two questions: how reversible a wrong answer is, and who is still making the call. Model quality does not place a single initiative on the grid.

Most roadmaps cluster in the wrong places. They under-build the top-left, where the compounding is cheap and safe, because it does not impress a board. And they reach into the bottom-right, because a model that decides on its own demos better than a model that helps a human decide. The demo is the trap. The deployment is the reality.

So deploy from the safe corner outward. Put the model to work where mistakes are cheap and a person stays in the loop. Make it a copilot on the decisions that carry judgment and consequence. And keep it out of the corner where a wrong answer is both irreversible and unattended. Anything sitting in that corner is your stall risk, and it deserves more of your attention than the accuracy score on your next pilot.

If you want a second read on where your own deployments sit on that grid, reach out. We have spent enough time in each box to know which ones pay off and which ones quietly stall a program.

Kyle Nakatsuji is the founder of Dearborn Labs and CEO of Clearcover, where the team has built and run production AI for nearly a decade.

// Key Questions

How should a carrier decide where to deploy AI?

A carrier should decide where to deploy AI by asking two questions about each use case, not one. The common question, "is the model accurate enough," is necessary but not sufficient. The better test is: how reversible is a wrong output, and is a human still making the decision? Reversible, low-harm work (triage, prefill, document synthesis, fraud flags) is safe to deploy now. Irreversible, high-harm decisions made by a model alone (a rate, a declination, a denial) are where pilots stall and should stay human-led.

Why do insurance AI pilots stall even when the model is accurate?

Insurance AI pilots stall even when the model is accurate because accuracy does not determine survival in production; the reversibility and stakes of the decision do. A model can clear its accuracy target in a sandbox and still fail when it is pointed at a regulated, hard-to-reverse decision, because one confident wrong output on a rate or a claim denial triggers regulatory exposure and destroys user trust. The model was rarely the bottleneck. The target was.

Which insurance decisions should AI not make on its own?

AI should not autonomously make high-harm, hard-to-reverse decisions, including setting or filing rates, declining applicants, denying claims, or setting reserves on large losses. These decisions carry regulatory and bad-faith exposure that can exceed policy limits, and the NAIC's 2023 Model Bulletin holds them to a higher standard precisely because they can harm consumers and often remove the human from the loop. On these decisions, AI should operate as a copilot to a human expert, not as the decision-maker.

What does the NAIC AI Model Bulletin require for high-stakes insurance decisions?

The NAIC's Model Bulletin on the Use of Artificial Intelligence Systems by Insurers (2023) takes a risk-based approach: the required rigor scales with the nature of the decision, the potential harm to the consumer, and how much a human is involved. It does not ban complex models or demand an audit trail for every output. It does mean that a model touching a rate, a declination, or a claim denial needs a human-readable rationale that can withstand a market-conduct exam, while a model that triages a queue or summarizes a file sits under a much lighter bar.

Where does AI create the most value in insurance claims?

AI creates the most measurable near-term value in high-frequency, low-complexity claims work, where straight-through processing can handle the large majority of simple claims and the feedback loop is days, not months. The harder economics sit in complex claims, a small share of total claim count that drives most of the loss dollars; there AI pays off as a copilot that sharpens adjuster judgment rather than as an automation that replaces it.

What is the difference between an AI copilot and autonomous AI in insurance?

An AI copilot assists a human who still owns the decision; autonomous AI makes and acts on the decision itself. In insurance, the distinction maps directly to risk: copilots are the right design for high-harm, judgment-heavy decisions (large-loss reserving, disputed coverage), where the model surfaces precedent and pressure-tests the read but the expert decides. Autonomous AI is appropriate only where a wrong answer is cheap and reversible, such as routing or prefilling, and should be kept out of regulated, irreversible decisions.

← Back to Insights

First things first.