Fraud ML for a payments platform at 20M transactions a month
2024-02-18 / 11 min / payments / fraud / ml / production
Cutting fraudulent transactions 70% and manual review load 55%. What actually worked, what the model could not solve on its own, and the three pieces we built before the model went live.
The setup
A US-based card payments processor running a high-throughput platform across card-present and card-not-present flows, serving ~20M transactions per month at a 99.999% availability target. Fraud and chargebacks were being caught by a mix of static rules and a manual review team. The catch rate was acceptable. The false-positive rate was bleeding good customers, and the review team had a multi-hour backlog on Mondays.
The goal was never to replace humans with a model. The goal was to make humans review fewer, higher-signal cases, and to declined-with-confidence on the obvious bad actors before they reached human eyes.
Three things we built before the model went anywhere near production
First, a labelled-event store. Every decision (rule fire, human verdict, chargeback outcome) was written to a single event log with stable schemas and the inputs the decision was made on. Without this, we would have had no honest training data and no way to measure regressions.
Second, a single review queue with priority. Rules and (later) model scores wrote into the same queue with a numeric priority. The queue was the only surface the review team worked from. This made it possible to swap scoring logic without disrupting the team.
Third, a rollback story. Every deployment of the scorer wrote a shadow score alongside the live decision for one week before it was allowed to influence anything. Promotion required a written diff of cohort movement vs. the previous scorer. No silent promotions.
The model
Gradient-boosted scorer (XGBoost) over engineered features: velocity windows on card and device, geolocation deltas, mismatch features between billing and shipping, BIN reputation, and merchant-segment risk scores. Trained on six months of labelled events with chargeback ground truth filled in on a 60-day lag.
The model did not autonomously decline anything in the first six weeks. Its score reordered the review queue and auto-cleared the bottom decile under monitoring. Each Monday we reviewed the auto-clear cohort against a sampled re-review and adjusted thresholds.
Once we had two months of post-promotion ground truth, we widened auto-clear and added a narrow auto-decline band on the top of the distribution. The bands grew per cohort: card-not-present first, card-present later, recurring billing never.
What we tried that did not pay off
A graph-based ring-detection layer over device and card co-occurrence looked great in offline eval. In production the lift was real but small, and the operational cost (nightly graph rebuilds, latency, on-call surface) did not justify it at our volume. We kept the feature pipeline and shelved the model.
We over-invested in probability calibration early. At our volumes, well-ordered scores were more valuable than well-calibrated ones. The reviewers worked off a ranked queue, not a probability, and the bands were tuned empirically.
The lesson worth keeping
Fraud ML projects do not fail because the model is bad. They fail because the team did not build the labelling, the queue, and the rollback story before the model went live. Build those three things first and you can ship a worse model that wins.
The 70% reduction in fraudulent transactions and the 55% reduction in manual review load did not come from the model alone. They came from the labelled event store that made every iteration honest, and the queue that let the team move faster without changing how they worked.