The challenge
A leading UK insurer needed to build their first customer-facing AI agent. The goal was straightforward: let customers ask questions about their policies and get accurate, helpful answers instantly.
The problem was that off-the-shelf RAG approaches weren't meeting the accuracy standards their regulators required. Insurance documentation is dense, interconnected, and full of conditional logic. A standard vector search retrieval approach was pulling in the wrong context, missing key conditions, and producing answers that were technically plausible but factually wrong.
In a regulated industry, "mostly right" isn't good enough.
The approach
I led a team of three engineers to design and build the system from the ground up. The key architectural decisions were:
GraphRAG with a custom ontology. Rather than chunking documents and throwing them into a vector database, we modelled the insurance domain as a knowledge graph. Policy types, coverage conditions, exclusions, and their relationships were all explicitly represented. This meant the retrieval system could follow logical chains rather than just finding semantically similar text.
Multi-stage agentic workflows. Instead of a single retrieve-and-generate step, the system used multiple stages: understanding the customer's intent, determining what information it needed, retrieving from the graph, validating the retrieved context, and then generating a response. Each stage had its own quality checks.
Evaluation infrastructure from day one. Before we wrote the first line of production code, we built the evaluation framework. Synthetic conversations generated from real policy documents. LLM-as-Judge scoring for groundedness, relevance, and completeness. RAGAS metrics for retrieval quality. This meant every architectural decision could be validated against real data.
The evaluation framework
This was the part that made the difference. We built three layers of evaluation:
- Synthetic test sets - hundreds of question-answer pairs generated from policy documents, reviewed by domain experts, covering standard queries, edge cases, and known failure modes
- LLM-as-Judge - automated quality scoring on every response, checking groundedness (is the answer supported by the source material?), relevance (does it answer what was asked?), and safety (does it stay within bounds?)
- Retrieval metrics - RAGAS-based evaluation of the graph retrieval, ensuring the right nodes and relationships were being pulled for each query type
This evaluation infrastructure wasn't just for development. It ran continuously in production, giving the team real-time visibility into system quality and providing the audit trail regulators required.
The outcome
The system moved from concept to production with accuracy that passed regulatory scrutiny. The GraphRAG approach significantly outperformed the standard RAG baseline the team had previously attempted, particularly on complex queries involving conditional logic and policy exclusions.
More importantly, the evaluation infrastructure gave the business confidence to put the system in front of customers. They could see, with evidence, that the system was performing within acceptable bounds - and they had the monitoring in place to catch any degradation early.
The evaluation framework wasn't overhead - it was the thing that made the project shippable. Without it, this would have been another AI prototype that never left the demo environment.