Where Insurance AI Compounds, and Where It Stalls

Kyle Nakatsuji·June 10, 2026·9 min read

The models are good now. Pilots stall because of where you point them, and telling the compounding layer from the judgment layer is what a decade inside a carrier teaches you.


The short version (by humans, for busy humans)

We know how reading works now: too much AI-written content, so you skim, or your AI reads for you. Fair. This part was written by humans, for busy humans, in under 400 words. The full post below carries the depth, with humans on the loop throughout — read it yourself, or let your AI ingest it. Either works.

The actual point:

The models are good now. That's not why your pilot died. Pilots die because of where you point the AI. Four things about insurance work decide where it pays off and where it won't:

Feedback is slow. You don't find out whether an underwriting decision was good until the losses develop, 12 to 18 months later. But that slow loop only covers the final decision. Triage, prefill, and fraud flags get graded in days, against answers you already have. And for the part that is slow, actuaries have priced on immature data for a century — borrow their methods.

Your data doesn't match across systems. "Closed" in claims isn't "closed" in litigation. That's not an AI-defeater (banks have it worse, and schema-mapping is what these models are best at). It's a sequencing problem. Build the coherent data layer first, and budget it like a real phase.

The valuable decisions are judgment. Don't ship confident automation into a large-loss reserve. Build AI that makes your best adjuster sharper. One confidently-wrong answer and they switch it off forever.

Explainability scales with stakes. The NAIC bulletin is risk-based, not absolute. Heavy audit trail where you rate, decline, or deny. Light where a human still makes the call.

Point AI at the data and workflow layer and it compounds. Point it at the regulated, judgment-heavy end and it stalls. Knowing where that line sits is what a decade of running AI inside a live carrier taught us.

That's the post. If you're trying to figure out whether your stalled pilot is a model problem or an aim problem, reach out.


The full version (human on the loop — for depth, yours or your AI's)

Every stalled insurance AI pilot gets roughly the same autopsy: The vendor overpromised. The data wasn't ready. The integration was nothing like how it was described. The business case died in the gap between sandbox and production.

Those diagnoses are real, but they just describe the symptom.

Here's the part the autopsy misses: the models are really good now. A VP or CIO who has sat through a failed pilot already knows it. The demo worked. The accuracy numbers were real. The thing still didn't reach production. If the model performed in the sandbox, "the model wasn't good enough" can't be the whole story.

The better diagnosis is about aim. AI gets better when you give it the necessary context. The value of AI compounds when you then point it at the data and workflow layer: prefill, triage, document synthesis, fraud signal; the work of turning six disconnected systems into one coherent picture. It stalls when you point it, unchanged, at the regulated and judgment-heavy decisions at the end of the chain like the rate, large-loss reserve, or coverage call.

Insurance work has four structural properties. While they might look like reasons AI fails, they actually mark the line between where AI compounds and where it still needs a human and years of context. Knowing where that line sits is the whole game. It's the thing a decade of running AI inside a live carrier teaches you, not something you reason out from a whiteboard.


1. Underwriting feedback is slow, so you validate it the way actuaries always have

In software, feedback is nearly instant. Code compiles or it doesn't. Tests pass or fail. A recommendation engine knows within hours whether someone clicked.

In underwriting, the signal that a decision was good is the loss experience on the policies you wrote. In short-tail lines, that arrives 12 to 18 months later. In long-tail casualty and workers' comp, it develops for years.

The easy conclusion is that you therefore can't validate the model until the losses come in. That conclusion is wrong, and any actuary will tell you why. Carriers have judged decisions on immature data for over a century. It's called pricing and reserving. You validate against mature accident years where the losses have fully developed, and you use loss-development and credibility-weighted methods to make early reads on the years that haven't. The discipline already exists. AI inherits it rather than breaking it.

The second mistake is assuming every piece of underwriting AI carries the 18-month loop. Most of the value doesn't. Triage, prefill, hit-ratio lift, and fraud flags all produce feedback in days, because you're measuring throughput and accuracy against an answer you already have, not waiting on loss emergence. The slow loop is real for the decision at the very end. It says almost nothing about the work leading up to it.


2. Your data doesn't speak to itself, and that's the prework nobody budgets for

A carrier's operational data lives across at least six kinds of system: policy administration, claims management, third-party feeds like MVR and CLUE and credit, document imaging, litigation management, and payment and recovery. Each was built by a different vendor, in a different decade, with different field names and different definitions of the same word. "Closed" in the claims system isn't "closed" in litigation. Whether a loss is "at fault" depends on which field you read and which state you're in. The address in the policy system doesn't always match the address on the FNOL.

This is real, and it's the single most underestimated line item in any carrier AI project.

Two things, though, that the failure story gets wrong. First, fragmentation isn't an insurance disease. Banks and hospitals have it worse. Second, reconciling messy, inconsistent schemas is something modern AI is genuinely good at. Mapping one system's vocabulary onto another's is close to the canonical strength of a large language model.

So the fragmentation problem is really a sequencing problem. You build the coherent data layer first, then you model, and almost every POC timeline assumes that layer already exists. Budget it as its own phase, with its own weeks and its own owner, or the model never gets a fair test.


3. The valuable decisions are judgment, not rules, so build to sharpen the human

The workflows where AI creates the most value in insurance are the hardest ones: complex underwriting, large-loss reserving, CAT claims, coverage disputes. None is a rule-following task with an answer you can derive from inputs and a procedure.

A senior adjuster working a complex loss is synthesizing incomplete information. The inspection report, the litigation history of the attorney on the other side, the jurisdiction's settlement culture, comparable losses in the book, and years of their own pattern recognition. They're making an inference under uncertainty with long-tail consequences, which is work that requires human expertise.

That doesn't mean the work is human-only and AI sits it out. A copilot that just watches isn't what the market is buying right now, and it isn't where the leverage is. The design goal is to sharpen that expertise and reasoning, instead of replacing it. It can surface the relevant precedents, pressure-test the inference, flag what the file is missing, or show a range of outcomes instead of a single confident number.

The pilots that die are the ones that confidently ship wrong automation straight into these decisions. It's output that sounds authoritative, confident, and correct in a moment that calls for calibrated uncertainty. A senior professional catches that once and never trusts the system again.

Build so the human's judgment gets sharper, and the AI earns its place on the hard cases instead of getting quietly switched off.


4. Explainability is a real bar, and it scales to the stakes of the decision

Some AI work in insurance touches regulated, consumer-facing decisions. There the explainability bar is real, and it's worth being precise about what it actually says.

The reference point is the NAIC's Model Bulletin on the Use of Artificial Intelligence Systems by Insurers, adopted in 2023 and picked up by a growing list of states. The bulletin is risk-based, not absolute. The rigor it expects scales with three things: the nature of the decision, the potential harm to the consumer, and how much a human is in the loop.

A model that helps set a rate, decline an applicant, or deny a claim sits at the high-harm end. There, "the model said so" won't survive a market-conduct exam or a bad-faith deposition, and you need a rationale a person can read and defend. A model that triages a queue, summarizes a file, prefills an application, or flags a claim for a human to look at sits at the light end. The human is the decision-maker, and the bar comes down accordingly.

What the bulletin does not do is ban black-box models or require an audit trail for every output. The practical rule is simpler than "everything must be explainable." Match the explainability you build to where the decision sits on the harm curve, and put your governance budget where the regulator's attention actually is.


What this means for where you point AI

Taken together, the four properties are a map of where AI compounds and where to hold back.

The carriers making real progress treat the data layer as the first project, not a chore to finish before the real work starts. Shared schema, canonical definitions, agreement on what "closed" and "at fault" mean across systems. That foundation is the work. Everything else runs on top of it.

They validate the actuarial way. They don't sit idle for 18 months waiting for ground truth. They lean on seasoned history and early indicators, the same tools the pricing and reserving teams have used for decades.

They put AI on the judgment-heavy decisions as a copilot that sharpens the human, instead of an oracle that replaces one. And they match explainability to the stakes: building the audit trail where a regulator will ask for it and staying light where a person is already making the call.

There's no special magic here. You point capable models at the parts of the operation where they compound, and you stay honest about the parts where the human, the regulator, and the slow feedback loop still set the terms.


That's the part we learned by shipping. Dearborn Labs is an affiliate within the Clearcover Insurance Holdings family, and we spent a decade building and running AI in production inside Clearcover Insurance Company, a live auto carrier. The four properties above are what a live book taught us, in production, year after year.

We're careful about what auto can and can't tell you. It's short-tail and structured, so it's no stand-in for long-tail casualty. What it gives us is production evidence, not theory, for exactly where these four properties bite and where they don't.

If you're mapping where your own AI deployment stalls and want a second read on whether it's a model problem or an aim problem, reach out.

// Key Questions

Why do insurance AI pilots stall even when the model performs well?

Because the problem is usually aim, not model quality. The models are good now — if the demo worked and the accuracy numbers were real, 'the model wasn't good enough' can't be the whole story. AI compounds when pointed at the data and workflow layer: prefill, triage, document synthesis, fraud signal, and the work of turning six disconnected systems into one coherent picture. It stalls when pointed, unchanged, at the regulated and judgment-heavy decisions at the end of the chain — the rate, the large-loss reserve, the coverage call. Knowing where that line sits is the whole game.

How can carriers validate underwriting AI when loss feedback takes 12 to 18 months?

The same way actuaries have judged decisions on immature data for over a century. Validate against mature accident years where losses have fully developed, and use loss-development and credibility-weighted methods to make early reads on the years that haven't. Also note that most underwriting AI value doesn't carry the 18-month loop at all: triage, prefill, hit-ratio lift, and fraud flags produce feedback in days, because you're measuring throughput and accuracy against an answer you already have. The slow loop is real only for the decision at the very end.

Is data fragmentation a reason insurance AI fails?

No — it's a sequencing problem, not an AI-defeater. Carrier data lives across at least six kinds of system, each with different field names and different definitions of the same word ('closed' in claims isn't 'closed' in litigation). But fragmentation isn't an insurance disease — banks and hospitals have it worse — and reconciling messy, inconsistent schemas is close to the canonical strength of a large language model. The fix is to build the coherent data layer first and budget it as its own phase, with its own weeks and its own owner. Almost every POC timeline assumes that layer already exists, and that's why the model never gets a fair test.

Should AI automate judgment-heavy insurance decisions like large-loss reserving?

No. The highest-value workflows — complex underwriting, large-loss reserving, CAT claims, coverage disputes — are inference under uncertainty with long-tail consequences, not rule-following tasks. The pilots that die are the ones that ship confident automation straight into these decisions; a senior professional catches one confidently-wrong answer and never trusts the system again. The design goal is to sharpen the human's expertise instead of replacing it: surface relevant precedents, pressure-test the inference, flag what the file is missing, and show a range of outcomes instead of a single confident number.

What does the NAIC Model Bulletin actually require for AI explainability?

The NAIC's Model Bulletin on the Use of Artificial Intelligence Systems by Insurers, adopted in 2023, is risk-based, not absolute. The rigor it expects scales with the nature of the decision, the potential harm to the consumer, and how much a human is in the loop. A model that helps set a rate, decline an applicant, or deny a claim sits at the high-harm end and needs a rationale a person can read and defend. A model that triages a queue, summarizes a file, or prefills an application sits at the light end because the human is the decision-maker. The bulletin does not ban black-box models or require an audit trail for every output — match the explainability you build to where the decision sits on the harm curve.

Where should carriers point AI first to get compounding value?

At the data and workflow layer. Carriers making real progress treat the coherent data layer — shared schema, canonical definitions, agreement on what 'closed' and 'at fault' mean across systems — as the first project, not a chore. They validate the actuarial way using seasoned history and early indicators instead of waiting 18 months for ground truth. They put AI on judgment-heavy decisions as a copilot that sharpens the human, and they match explainability to the stakes: heavy audit trail where they rate, decline, or deny, and light where a person is already making the call.

Share
← Back to Insights