The challenge

A fraud detection startup needed their AI to give consistent, deterministic results. Their clients were making real decisions based on the system's outputs. Trust depended on the AI being reliable, explainable, and consistent.

The core problem wasn't performance - it was structure. Prompt logic was tangled with orchestration code, outputs were unstructured and inconsistent between runs, and client-specific requirements were handled through one-off code branches. Every new client meant more complexity, not more capability.

The approach

Structured outputs for determinism. The key insight was constraining each module to answer a single, focused question - yes or no. Rather than asking an LLM to produce a broad fraud assessment in free text, each step made one narrow decision with a structured output. This massively increased determinism: a binary decision with a confidence score and evidence chain is far more consistent across runs than an open-ended analysis. It also made the system auditable - every decision was traceable to a specific question, a specific answer, and the evidence behind it.

Modular, composable architecture. Each of those focused decisions lived in its own module - a self-contained unit responsible for a single analysis strategy. New fraud detection strategies could be added as new modules without touching existing logic. The orchestration layer composed these modules into pipelines, aggregating their individual yes/no signals into an overall fraud assessment. This made the system easy to reason about, test in isolation, and extend.

Client-specific configurability. Each client had different risk profiles, regulatory requirements, and reporting needs. Rather than forking code per client, I built a configuration layer that controlled which modules ran, what thresholds applied, and how outputs were formatted - all without code changes. A base fraud analysis pipeline could be extended with industry-specific context, custom risk weightings, and client-specific output requirements. Onboarding a new client became a configuration exercise, not an engineering project.

Evaluation frameworks over golden datasets. The team had examples of known fraud cases and known clean cases. I built evaluation infrastructure that ran the system against these golden datasets on every change, tracking whether accuracy, precision, and consistency held up. This turned "does the AI still work?" from a subjective question into a measurable one.

Pre-sales and client work

Beyond the engineering, I worked directly with pre-sales clients to configure and demonstrate the system. This meant understanding their specific fraud patterns, setting up tailored analysis pipelines, and running live demos against their data. The tight feedback loop between client needs and engineering decisions was critical for building the right product.

The outcome

The modular architecture meant new fraud detection strategies could be added in hours instead of days. Client onboarding went from a multi-week engineering effort to a configuration exercise. Constraining each module to a single yes/no decision gave the system the determinism that financial clients demanded - consistent, auditable results they could trust and integrate directly into their workflows.

The evaluation framework gave the team confidence to ship changes quickly - they could see, with data, whether a change improved or degraded the system before it reached clients.

When your clients are making real financial decisions based on your AI's output, open-ended analysis isn't enough. Constrain each step to one decision, structure the output, and build the evaluation infrastructure to prove it's consistent at scale.

Need structured, reliable AI your clients can trust?

Book a free 30-minute call to talk through your challenge.

Book a consultation