#72 - The self-learning fraud model is a lie

Mar 28

“Our fraud detection algorithm learns by itself from every new event!”

How many times have you heard a vendor claim that? Countless times, I bet.

And how many times was it actually true?

I can tell you - zero times. Guaranteed.

Want to know how I know? Today I’m going to cover why these claims are always false, how to spot it, and why it even matters.

Let’s talk about it.

The lie we all want to believe in

Ever since ML took the fraud detection space by storm, we were taught that “more data means better performance.”

Therefore, it’s easy to make the jump and conclude that training the model “on the fly” - with the fresh data it just processed - is the main method of getting more data.

Not only do we get more events in our dataset, but these are also more representative of how fraudsters behave today rather than a year ago.

Sounds good?

It certainly does on paper, so how come this principle fails in reality?

Because the real axiom we should adhere to has two more words to it: “more high-quality data means better performance.”

Here’s the thing:

Having hundreds of millions of events that I can feed to my model isn’t worth a dime if my data quality is low.

As the saying goes: “garbage in - garbage out.”

But then, you might ask, why would events processed in production by the same model be of lesser quality than the dataset the model was trained on?

The answer is simple: it’s not about missing or corrupted datapoints.

It’s about the labels.

(Yes, last week’s topic is back, but this time from a whole new angle!)

Missing labels kill self-learning

Fraud labels don’t arrive in real time.

When your model makes a decision - approve or block - you don’t immediately know if it was the right one.

Chargebacks usually arrive 30 to 90 days later, sometimes longer - depending on your dispute process and the payment method.

But the problem isn’t the lag, it’s not knowing how long it’ll be for each and every case.

Case in point - if I processed a payment 30 days ago and it didn’t get a chargeback yet, does it mean it’s a good payment or is it still undecided?

The answer is, of course, that we simply don’t know.

And if we don’t know, we cannot use this event to train the model.

But then you might say:

“OK, we can’t use the fresh, non-fraud events yet, but what about events that did get labeled as fraud? Wouldn’t they help teach the algorithm how fraud evolves?”

It’s not that simple.

Partial labels are even worse

There are two issues with feeding only confirmed fraud events to your algorithm.

The first one is pretty straightforward - as you feed only bad examples to the algorithm, you slowly change the baseline fraud rate it perceives as “normal”.

To give an example, say you have a fraud rate of 0.3% and you trained your model on a dataset with a similar composition.

If you continue to feed into the dataset confirmed fraud cases only, you quickly change the dataset’s fraud rate to, say, 1%.

This would train the model to flag fraud more aggressively, while your live population hasn't really changed.

The result? Higher false positive rate.

But there’s another, more subtle problem we create when we do that. To understand it, let’s take a step back - what are we expecting to gain by feeding the model with fresh, confirmed fraud cases?

Supposedly, we want it to learn how new fraud attacks behave.

But if we see new patterns in these cases, without feeding the model good cases as well, we’ll never train it to effectively split these patterns from good customers who show the same behavior.

The result? Once more - higher false positive rate.

Why should you even care?

To be blunt - if everyone’s feeding you the same bullshit story, what does it really matter? You could just ignore such claims and move on.

But here’s the thing:

When you’re promised “self-learning” algorithms and it’s a baseless claim, you never get to talk about practical things that do matter.

For example, “how frequently do you retrain your model?”

If the answer is “we don’t need to, it self-learns”, I would be concerned that in reality the model is never retrained.

And what is the result of not frequently retraining the model? You guessed it - a higher false positive rate.

To avoid this risk, here’s what you should ask when you hear such a claim from a vendor: “can you explain to me how you feed fresh events to your models without fraud labels?”

Unless you hear a convincing story that shows how they managed to solve the challenges I outlined above - don’t buy it.

Side note: Vendors can try and wiggle their way out by claiming their model is unsupervised and doesn’t require labels. However, in my experience such models are rarely the main solution vendors deploy outside specific features, and for a good reason.

If the vendor doesn’t give you convincing arguments for their claims, it’s best to initiate the “how frequently do you retrain the model?” conversation directly.

It might be an awkward one, but it’s better to have an awkward conversation than no conversation when it comes to your false positive rate.

Oh, and in the unlikely case a vendor manages to convince you their model is self-learning, you may want to read this before getting too excited.

The bottom line

"Self-learning" is one of the most abused terms in fraud vendor marketing.

The label lag means you can never be sure when your data is actually ready to train on, and feeding in only confirmed fraud skews the model's perception of what's normal.

None of that is self-learning. Just a very expensive way to fall further behind.

So the next time a vendor makes this claim, ask one simple question: "How do you feed fresh events to your model without fraud labels?"

If the answer is convincing, great. If it's not, push on retrain frequency instead. That conversation will tell you everything.

Have you ever pushed a vendor on this and got a compelling answer? Hit the reply button - I’d genuinely love to hear what a good response to this question sounds like.

In the meantime, that’s all for this week.

See you next Saturday.

P.S. If you feel like you're running out of time and need some expert advice with getting your fraud strategy on track, here's how I can help you:

Free Discovery Call - Unsure where to start or have a specific need? Schedule a 15-min call with me to assess if and how I can be of value.
Schedule a Discovery Call Now »

Consultation Call - Need expert advice on fraud? Meet with me for a 1-hour consultation call to gain the clarity you need. Guaranteed.
Book a Consultation Call Now »

Fraud Strategy Action Plan - Is your Fintech struggling with balancing fraud prevention and growth? Are you thinking about adding new fraud vendors or even offering your own fraud product? Sign up for this 2-week program to get your tailored, high-ROI fraud strategy action plan so that you know exactly what to do next.
Sign-up Now »

Enjoyed this and want to read more? Sign up to my newsletter to get fresh, practical insights weekly!

Chen Zamir