#70 - You might be testing your solutions blindly

Every time you deploy a new solution - a rule, a workflow, a model, or an AI agent - you’re making a bet.

You’re betting that the logic you designed will work exactly the way you planned, on live traffic, under real conditions.

But often it doesn’t.

How come? Well, simply put, it’s easier said than done.

Sometimes you don’t have relevant data to test on, sometimes it takes too much effort and time, and sometimes it’s just not possible technically.

That’s why testing is one of the most underrated processes in risk management. 

Most teams treat it as a formality—a quick sanity check before pressing deploy. The result? Costly rollbacks, false positive blowups, and fraud spikes that could have been caught earlier.

So how do you test?

There are three distinct methods for testing solutions before they impact real users: backtesting, Champion-Challenger (A/B testing), and shadow mode. Each serves a fundamentally different purpose, and picking the wrong one is almost as bad as skipping the test entirely.

In today’s issue I’d like to go over the different methods, outline their pros and cons, and suggest when it is best to use them.

Side note: I recently published a detailed breakdown of how to structure a safe solution release process on Sardine’s blog, including a recommended six-step framework. Worth reading alongside this one.

Backtesting

Backtesting runs your solution against a historical, labeled dataset before it touches production in any way.

You simulate what would have happened if the solution was live in the past - how much fraud you would have caught, how many false positives it would have created, and what the expected precision looks like.

Pros and cons:

The main advantage of backtesting is that it’s as close to a production environment as you can get without actually affecting production. 

The speed is also a major plus - you can iterate rapidly, test multiple variants back-to-back, and cut the bad ones before writing a single line of deployment code.

But when backtesting you also need to consider that fraud is a lagging indicator - labels take time to confirm, disputes take weeks to resolve, and by the time data is labeled, it’s already weeks or months old.

That means backtesting can’t reflect what’s happening right now. New fraud patterns that emerged last week won’t show up in it.

Another limitation is that backtesting is inherently isolated. It tests one solution in a vacuum, without any visibility into how it interacts with the rest of your system - your other rules, models, or workflows running in parallel.

When to use:

Backtesting works best for limited-scope solutions where the population and pattern are well-defined and historically stable.

Think behavioral rules - targeted, narrow, and designed to address a specific known pattern. 

It’s also the right tool for any fast iterative process where speed of experimentation matters more than capturing live signal.

Champion-Challenger

Champion-Challenger runs two versions of a solution simultaneously on live traffic: the current version (Champion) and the new one (Challenger). 

Each incoming event is routed to one or the other, and you compare outcomes over time.

Pros and cons:

Unlike backtesting, Champion-Challenger captures system-level effects. For example, second-order KPIs like customer lifetime value (LTV) that only emerge once real decisions are made. 

You get genuine performance data that reflects the actual impact of the decision, not just a simulation of it.

But it comes at a cost - Champion-Challenger is slow by nature. 

Meaningful results require enough volume and time to reach statistical significance, and that window can span weeks, sometimes more. 

There’s also a real short-term risk: if the Challenger has a flaw, it will cause real-world damage before you catch it.

Finally, Champion-Challenger only works well if it’s built into the system you’re running it on.

Without proper tooling to track split routing, attribute outcomes, and deduplicate results, interpreting the data becomes unreliable.

When to use:

Champion-Challenger is best suited for high-stakes system changes: replacing a fraud model, introducing an alternative workflow, or changing a strategy that affects large or broad populations. 

It’s also the right method when you specifically need to measure second-order effects - things like churn, approval rate shifts, or LTV changes - that only emerge from real-life decisions being made.

Side note: If you’re worried about introducing risk, remember that you don’t have to split the two flows 50-50. Your challenger can start with impacting 10% of the population, and you ramp it with time and gained confidence.

Shadow Mode

Shadow mode runs your new solution in parallel to the live system, processing every event and logging every decision. Only that those decisions have zero real-world effect.

It lets you observe performance within your normal production stack, without impacting customers.

Pros and cons:

Shadow mode has a unique advantage: it’s production-level testing without creating new exposure. You see exactly how your solution behaves on real traffic, under real conditions, without any risk of adverse impact. 

And because it’s relatively simple to monitor - you’re just reviewing logged decisions against live outcomes - it’s more accessible than Champion-Challenger.

But shadow mode also has a fundamental blind spot: it only reliably observes traffic that is currently approved.

If you’re testing a solution designed to approve populations that are currently being declined, you won’t see them. Their loss rates, behavior, and conversion patterns are entirely invisible.

Data collection, similarly to Champion-Challenger, is also slower than backtesting. If the pattern you’re testing for is rare, you may be waiting weeks to accumulate enough signal to draw conclusions.

When to use:

Shadow mode is best used as a mandatory validation gate before going fully live.

This is especially true for high-risk solutions where the cost of an adverse effect is high and your backtesting data is insufficient or outdated. 

It’s also the right tool when you’re testing against new fraud patterns or new populations where historical data simply doesn’t exist.

The bottom line

These three testing methods aren’t necessarily alternatives to each other. They can also be different stages of the same process:

  • Backtest to design and validate fast. 

  • Shadow mode to pressure-test against live traffic before you commit. 

  • Champion-Challenger when you’re ready to measure real-world impact head-to-head.

Layering testing methods might seem like overkill, but when you roll out strategic changes it can help with balancing exposure and time-to-decision.

The teams that manage solution testing well treat it as a process with defined checkpoints, not a single box to tick before pressing deploy.

Which of these methods does your team currently rely on, and where do you feel the gaps are? Hit the reply button and let me know!

In the meantime, that’s all for this week.

See you next Saturday.


P.S. If you feel like you're running out of time and need some expert advice with getting your fraud strategy on track, here's how I can help you:

Free Discovery Call - Unsure where to start or have a specific need? Schedule a 15-min call with me to assess if and how I can be of value.
​Schedule a Discovery Call Now »

Consultation Call - Need expert advice on fraud? Meet with me for a 1-hour consultation call to gain the clarity you need. Guaranteed.
​Book a Consultation Call Now »

Fraud Strategy Action Plan - Is your Fintech struggling with balancing fraud prevention and growth? Are you thinking about adding new fraud vendors or even offering your own fraud product? Sign up for this 2-week program to get your tailored, high-ROI fraud strategy action plan so that you know exactly what to do next.
Sign-up Now »

 

Enjoyed this and want to read more? Sign up to my newsletter to get fresh, practical insights weekly!

<
Next
Next

#69 - You might already be too late