The Demo Works. Now What?

Building AI products has never been easier. The prototype is up in days, the capability is impressive - the potential feels endless. It's an exciting time.

But the demo is easy. Moving to production is difficult. And amongst the excitement we can forget hard-won lessons from other disciplines that still apply. Principles that are the difference between an impressive demo and a system you can trust in production.

It all starts with evaluations.

The Principles We're Forgetting

Test-driven development exists because software teams discovered that writing tests should come first.

The V-model in systems engineering puts verification and validation at the centre.

Every mature engineering discipline has some version of the same insight: define what success looks like first, then measure against it.

The fact that LLMs are probabilistic doesn't detract from this. In fact it matters more than ever. You cannot inspect every line of code and know what the system will do. The only way to know whether your AI is working is to measure it - systematically, continuously, against criteria you've defined.

At Rolls-Royce, I worked with a principle called ALARP - As Low As Reasonably Practicable. The point was not to eliminate all risk. It was to be deliberate and documented about what level of risk was acceptable. You could not just say the system was safe. You had to be able to show your working.

With AI that principle is Evaluation-Driven Development.

Evaluation-Driven Development (EDD)

EDD takes the same principle and applies it to AI products: define what success looks like first, then measure against it.

This often requires specialist knowledge and collaboration across departments - understanding the domain, the users, the edge cases, the acceptance criteria and any regulations.

A frame I like to use is inputs and metrics.

What inputs do you need to test the system against - what are your users likely to ask? How will they try to break the system?

And how does the system need to perform given those inputs - what must it do and never do. This often includes operationalising abstract terms such as hallucinations and accuracy - into measurable criteria.

Think of it as a product design exercise with an evaluation suite as the output. A suite you can directly measure your performance against going forward.

It's often painful, the temptation is to build and see how it goes. But taking a step back and going through this exercise first pays off in multiple ways.

How this pays back

Once the evaluation suite exists, three things become straightforward.

Performance is monitored

Now you know your metrics, monitoring performance is easy. If a stakeholder needs to know how the system is performing - you can show them. Important for trust and compliance.

Regression testing is enabled

AI capability is always improving. When a new model is released that could save you latency or cost, you already have your test suite - run it - still performing? Release it. The same with introducing new knowledge or tweaking prompts. It can all be tested quickly within CI/CD.

Self-learning can begin

There's a reason AI Systems got so good at coding. Coding is verifiable - the system compiles, it passes tests, the front-end renders. This allowed the systems to self-improve via reinforcement learning.

Once you have the evaluation suite in place, you can set up the same, a flywheel of improvement.

Every query that gets evaluated tells you whether you're hitting or missing your metrics - where your system is failing - and what needs to be fixed - acting as a direct input into the next iteration.

Get this right and it gets you well ahead of your competition and makes it hard for them to catch up.

I learnt this the hard way

I've learnt the hard way what happens when you get absorbed in the excitement of building AI products without the up-front work.

I was part of the Gen AI Innovation Team at a major FTSE100 company that was tasked with building a customer service agent. The capability of the technology was there, so we were excited and moved quickly. We built the demo. It looked impressive. We thought RAGAS was the answer to evaluations. What we soon discovered was without the domain context of what good looked like, we had no way of evaluating whether it would be successful in production.

At that point there were a team of AI Engineers on the payroll, building, but unsure of the best direction. What we really needed was the domain expertise to define the success criteria up front - the inputs, the metrics, and how to comply with the regulations. We went back to the drawing board and ended up re-architecting. If we'd gone for evaluations-first we'd have saved months of time and effort.

I have also seen it from the other side - working with companies who had shipped AI products to production and had no visibility of how well they were performing. They were relying on customer complaints as their quality signal, which means the evaluation was happening in production, on real customers, with real consequences. And they were unsure how any releases would impact performance.

These experiences are what shaped my belief in the evaluation-first approach. Not because it sounds good in a blog post, but because the alternative is painful and expensive.

It's easier than ever to build. Set up your evaluations first, and it compounds

The barrier to getting an AI prototype running has never been lower. But moving it to a production product you can depend on is where things get tricky.

The hard part is not the implementation. The hard part is seeing through the excitement and slowing down long enough to do the upfront work: uncovering the processes, and defining what good looks like for your specific product.

There is no off-the-shelf solution for that. Ninety percent of the work is in the discovery - understanding the domain, the users, the edge cases, the regulations and the acceptance criteria. That is bespoke work, and it is the most valuable work in the entire process.