If You Can't Measure It, You Can't Trust It

The demo looked great. Then what?

You've seen the demo. The AI answers questions, drafts responses, pulls data from your systems. Everyone in the room is impressed. The CEO is nodding. The budget gets approved.

Six weeks later the system is live. And no one can tell you whether it's actually working.

Not "working" as in the server is running. Working as in: is it giving your customers the right answers? Is it handling edge cases? Is it hallucinating? Is it getting better or worse over time?

This is the gap I see again and again. Teams invest heavily in building AI systems but almost nothing in measuring them. And without measurement, you don't have a production system. You have a demo with a login page.

Why evaluation gets skipped

It's not that teams don't care. It's that evaluation feels like a problem you can solve later. The pressure is always to ship. Get it in front of users. Show value.

And unlike traditional software, AI systems don't fail with a clear error message. They fail quietly. A wrong answer here. A hallucination there. A subtle drift in quality that nobody notices until a customer complains. Or worse, until a regulator asks questions.

The other reason evaluation gets skipped is that people don't know how to do it well. With traditional software you write unit tests. With AI systems, the outputs are non-deterministic. The same input can produce different outputs. So how do you test something that's different every time?

What good evaluation actually looks like

Evaluation isn't one thing. It's a set of practices that give you confidence your system is doing what you need it to do. In my experience, it breaks down into three layers.

1. Offline evaluation - before deployment

This is your baseline. Before anything goes live, you need a set of test cases that represent what the system should handle. Real questions from your domain. Real edge cases. Known failure modes.

You run your system against these cases and measure quality. Are the answers grounded in the right source material? Are they factually correct? Are they formatted the way you need?

This sounds simple. But most teams skip it because building a good test set takes effort. It means sitting with domain experts, collecting real examples, defining what "good" looks like for each one. That's not glamorous work. But it's the work that matters.

2. Online evaluation - after deployment

Offline tests tell you how the system performs in controlled conditions. Online evaluation tells you how it performs in the wild.

This means monitoring real interactions. Tracking user feedback. Flagging low-confidence responses. Sampling outputs for human review.

The best systems I've built include automated quality checks that run on every response. Think of these as guardrails: is the response relevant to the question? Does it reference the right source documents? Does it stay within the boundaries of what it's supposed to do?

3. Continuous improvement - closing the loop

Evaluation without action is just reporting. The real value comes when you feed what you learn back into the system.

A user flags an incorrect response. That becomes a new test case. A pattern of failures in a specific topic area tells you there's a gap in your knowledge base. A drop in quality after a model update tells you to roll back.

This is where AI systems go from fragile to robust. Not through one big launch, but through hundreds of small improvements driven by real data.

Evaluation in regulated industries

If you're in healthcare, finance, insurance, or any regulated space, evaluation isn't optional. It's a compliance requirement.

Regulators want to see audit trails. They want to know what data the system used to make a decision. They want evidence that the system is performing within acceptable bounds.

This is actually good news. Because the same evaluation infrastructure that satisfies a regulator also makes your system better. Traceability, monitoring, quality metrics - these aren't bureaucratic overhead. They're engineering best practices that happen to also tick compliance boxes.

The teams that treat evaluation as a core part of their AI system - not an afterthought - are the ones that ship with confidence and sleep at night.

I've seen this firsthand at Aviva, where building the evaluation infrastructure for AI systems in insurance wasn't just about meeting regulatory requirements. It was about knowing, with evidence, that the system was doing right by customers.

What to measure

The specifics depend on your use case. But here are the metrics I find myself coming back to across projects:

Groundedness - is the response based on the source material, or is it making things up?
Relevance - does the response actually answer the question that was asked?
Completeness - does it cover what it needs to, or is it missing critical information?
Boundary adherence - does it stay within its defined scope, or does it wander into topics it shouldn't?
Consistency - does it give stable, reliable answers to similar questions over time?
Safety - is the response free from toxic, harmful, or inappropriate content? Even if the underlying model is capable, your system needs guardrails that prevent it from generating responses that could cause harm - whether that's offensive language, dangerous advice, or biased outputs. This matters everywhere, but especially when your AI is customer-facing.

None of these are vanity metrics. Each one maps directly to a failure mode that can hurt your users or your business.

Start before you build

The best time to think about evaluation is before you write a single line of code. If you can't define what "good" looks like for your AI system, you're not ready to build it.

That doesn't mean you need a perfect evaluation framework on day one. It means you need to ask the right questions early. What are the critical failure modes? What would a bad answer look like? How will we know if quality is declining?

These questions shape everything downstream - your architecture, your data strategy, your monitoring approach.

If you can't measure it, you can't trust it. And if your users can't trust it, it doesn't matter how impressive the demo was.