#42 - 5 ways to build fraud models without perfect data
The most frustrating part of running Fraugster wasn’t building the fraud detection algorithms.
It was dealing with prospects who desperately needed our help but couldn’t give us the data to help them.
"We love what you’ve built," they’d say on sales calls. "But we’re launching a completely new product line next month and have zero fraud history."
Or: "We’ve been running for two years, but honestly, we never thought to collect fraud labels. Can you still help us?"
Each time, I’d watch our sales cycle extend by 6-12 months while we figured out creative ways to bootstrap their models. The operational overhead was killing us.
That's when we realized: if we wanted to scale our business, we needed to solve the missing data problem once and for all.
Today, I want to share the analytical frameworks we developed.
Why? Because this challenge isn’t unique to fraud vendors. Every company trying to leverage machine learning for fraud prevention eventually hits this wall.
So let’s talk about it.
The Machine Learning Data Paradox
Here’s the thing about ML models: they’re data hungry.
Great, original insight, right? Bear with me for a second on this one.
A decent fraud detection model needs thousands—ideally tens of thousands—of labeled examples to perform well.
But fraud is a relatively rare event, and building that dataset naturally takes time you often don’t have.
The business wants fraud protection now. The model needs data from months ago.
This creates what I call the ML Data Paradox: the companies that need fraud protection most urgently, are often the ones with the least useful training data.
So how do you shortcut the problem?
It depends on which data you’re actually missing.
Scenario 1: You have no data at all
This happens more often than you’d think.
You’re a startup launching your first product. You’re an established company entering a new market. You’re adding a payment method you’ve never supported before.
Whatever the reason is, you’'re starting from zero.
Option 1: Borrow patterns from similar flows
If you have any transaction data—even from different products or regions—you can often transfer fraud patterns.
But you want to make sure that your underlying model features support this.
Based in the US and entering LATAM? Make sure your identity features can parse Spanish and Portuguese.
Launching instant ACH payments? Prepare to collect fraud labels in different formats than credit card chargebacks.
Here’s the thing:
Using your “main” models on this new population might show a significant drop in performance. But if you make sure they consume the right new data, their performance will return to normal levels quite quickly.
Option 2: Synthetic data generation
This is the newest technique in our toolkit, and I’ll be honest, I have much less experience with it.
The idea is to grab a tiny portion of data and use LLMs to extrapolate it to a bigger dataset that looks different enough to test your algorithm, but similar enough for it to make sense.
And that’s the catch: you need some real data to validate that your synthetic patterns actually match reality.
This approach requires caution.
I’ve seen teams get excited about this approach and generate thousands of "fraud" examples that look nothing like actual fraud in their environment.
Use it, but use it carefully.
Option 3: Start with industry knowledge
Sometimes the best approach is the most obvious one: implement basic fraud rules based on proven heuristics, then use those early decisions to start building your training dataset.
Wondering where to start? Consider these core heuristics:
Velocity checks: block an excessive number of transactions from the same identifier (card, email, IP address, etc.)
Geographic restrictions: block geographies you’re not expecting to see in your user-base.
Exposure controls: block high-amount single transactions or accounts that spend too much too fast.
It’s not glamorous, you might suffer from high levels of false positives, but you’ll be able to start monitoring and adapting to your business needs while having some controls in place if you need to react to an ongoing attack.
Side note: Don’t forget to implement safety net rules, especially when launching a new product with little to no data.
Scenario 2: You have data but no labels
This one hits closer to home because it’s usually self-inflicted.
You’ve been processing transactions for months or years, but nobody thought to systematically track fraud outcomes.
Maybe you have some chargeback data, but it’s incomplete. Maybe you have customer complaints, but they’re not structured.
Your database is full of transaction data, but from a machine learning perspective, it's almost worthless.
Option 1: Soft labeling
This is where you get creative with proxy indicators.
Instead of waiting for confirmed fraud labels, use signals that correlate with fraud: rapid refunds, issuer declines with specific reason codes, customer service escalations, behavioral anomalies flagged by existing tools, etc.
At Fraugster, for example, we trained a model using only issuer decline patterns.
It had only 70% of the accuracy we’d get with proper chargeback labels, but we could deploy it immediately instead of waiting 6 months for the engine to train itself.
And sometimes speed is more important than performance.
Option 2: Network analysis
This is one of my favorite techniques because the results are usually obvious once you start looking for them.
Look for shared identifiers across your transaction data: email domains, device fingerprints, shipping addresses, IP addresses.
Large fraud rings often leave obvious footprints if you know how to search for them.
We once identified 2,000 fraudulent transactions for a client just by mapping connections between payments that shared suspicious email patterns. Instant training data.
Side note: This approach tends to catch unsophisticated fraud, so your model might struggle with more advanced attacks. But it’s still a great starting point.
Your Biggest Constraint Isn’t Technical
Here’s what I learned after years of wrestling with this problem:
The biggest constraint isn’t finding clever ways to generate training data. It’s organizational flexibility.
Businesses want fraud protection immediately. Data scientists want perfect datasets. Product teams want to launch new features without having to care about fraud implications.
The companies that succeed are the ones that embrace "good enough" data to get started, then systematically improve their data collection as they scale.
What’s your experience been like? Are you currently stuck on a fraud project because of missing training data? Have you tried any creative approaches to bootstrap your models?
Hit the reply button and let me know. I’m always curious to hear how teams are solving this challenge.
In the meantime, that’s all for this week.
See you next Saturday.
P.S. If you feel like you're running out of time and need some expert advice with getting your fraud strategy on track, here's how I can help you:
Free Discovery Call - Unsure where to start or have a specific need? Schedule a 15-min call with me to assess if and how I can be of value.
Schedule a Discovery Call Now »
Consultation Call - Need expert advice on fraud? Meet with me for a 1-hour consultation call to gain the clarity you need. Guaranteed.
Book a Consultation Call Now »
Fraud Strategy Action Plan - Is your Fintech struggling with balancing fraud prevention and growth? Are you thinking about adding new fraud vendors or even offering your own fraud product? Sign up for this 2-week program to get your tailored, high-ROI fraud strategy action plan so that you know exactly what to do next.
Sign-up Now »
Enjoyed this and want to read more? Sign up to my newsletter to get fresh, practical insights weekly!