BAF Fraud Modeling

Rob Wiederstein

February 23, 2026

Introduction

Bank Account Fraud Dataset

Synthetic online account applications
1M rows (Base)
8 months (0–7)
Base + 5 biased variants
Label: Fraud vs Legit
Fraud \(\approx 1\%\)

What it is (plain English): each row is a bank account opening application submitted online. Fraudsters may impersonate someone (identity theft) or invent a person; once approved they quickly exploit the credit line or use the account to move illicit funds.

Why it exists: the BAF suite was created as a large, realistic benchmark to stress-test ML performance and fairness under dynamic / drifting conditions and “extreme” class imbalance. The variants introduce controlled bias patterns; the Base set has no induced bias.

How it was made: the released data are synthetic (generated from a CTGAN trained on an anonymized, feature-engineered real dataset). Privacy protections mean no row corresponds to a real identifiable person.

Time structure: month ranges 0–7 (eight months). This is why we use chronological evaluation (train early months, test late months).

Target variable: datasheet label is fraud_bool (0/1). In our pipeline we rename/recode to outcome with labels “Legit” and “Fraud” for readability.

Typical Scenario

Fraudsters will

Impersonate someone or
Create fake identity then
Max out the line or
receive illicit payment

Data Cleaning

Relabel outcome.
-1 → NA.
Negative amount → NA.
Write clean Parquet.

Explore

Variable Importance

Figure 1: Top 15 features driving the diagnostic model.

Feature Interaction

Figure 2: Interaction between Credit Risk Score and Address History.

Missingness Signal

Figure 3: Missingness rates by outcome.

Numeric Correlation

Figure 4: Core numeric correlation matrix.

LightGBM

About

Originally released in 2016
Maintained by Microsoft
Over 18,000 stars on GitHub
King of Kaggle for tabular data
Announcing paper over 23,000 citations
Sped up similar gradient boosting algorithms 20x

For tabular supervised learning, gradient boosted decision trees—most notably XGBoost and LightGBM—are strong, low-latency baselines because they exploit hand-engineered behavioral features; LightGBM remains a standard reference point for card and e-commerce fraud tasks [1]

[W]e found that the LightGBM approach had the highest detection accuracy of fraudulent activity with 97% in the experiments conducted. An additional key objective of reducing false alerts was accomplished, as the number of false alarms went from 13,024 to 6,249[2]

[W]e choose LightGBM as the base machine learning model due to its efficiency and widespread use in handling large-scale and structured datasets, particularly in financial domains such as credit card fraud detection.[3]

Unbalanced Classes

The Challenge

The scarce occurrences of rare events impair the detection task …

Bank Fraud Prevalence

Figure 5: Fraudulent versus legitimate applications by month.

Fraud Prevalence

Table 1: Something

Month	Fraud	Legit	Total	% Fraud
0	1,500	130,940	132,440	1.13
1	1,198	126,422	127,620	0.94
2	1,198	135,781	136,979	0.87
3	1,392	149,544	150,936	0.92
4	1,452	126,239	127,691	1.14
5	1,411	117,912	119,323	1.18
6	1,450	106,718	108,168	1.34
7	1,428	95,415	96,843	1.47

Methods Tested

Standard: Baseline (No sampling).
Weighted: Cost-sensitive learning (\(4\times\) penalty).
Undersampling: Random removal of majority class.
SMOTE: Synthetic Minority Over-sampling Technique.
ADASYN: Adaptive Synthetic Sampling (hard examples).
Tomek Links: Cleaning boundary ambiguity.

Strategy Showdown: Results

Table 2: Performance comparison across imbalance strategies using 3-month rolling windows.

Class Imbalance Strategy Showdown
Paired t-test comparison against 'Standard' baseline
recipe	avg_pr_auc	avg_runtime	p_val_vs_std	significance
Smote	0.1635	3.9054546	0.9032	No (ns)
Standard	0.1631	2.3006599	1.0000	-
Adasyn	0.1627	3.7494911	0.8361	No (ns)
Weighted	0.1614	2.2227480	0.3971	No (ns)
Tomek	0.1497	2.5417861	0.0501	No (ns)
Under	0.1403	0.9416666	0.0505	No (ns)

Sampling Compared

Figure 6: PR-AUC performance versus computational training time.

Sampling Methods Discarded

No statistical gain
Resource intensive
Scalability

Feature Creation

Final Results

The Confusion Matrix

Figure 7

Precision & Recall

\[\text{Recall} = \frac{TP}{TP + FN}\]

Of all actual frauds, how many did we catch?

\[\text{Precision} = \frac{TP}{TP + FP}\]

Of all flagged cases, how many were real fraud?

ROC vs Precision-Recall AUC

ROC AUC
PR AUC

Plots Recall vs False Positive Rate
AUC = 0.5 is random; 1.0 is perfect
Optimistic under class imbalance
Inflated by the large TN pool

Plots Precision vs Recall
Focuses entirely on the minority class
Harder to game with a large Legit majority
Preferred metric for fraud detection

Final Model Evaluation

Figure 8: Confusion Matrix Heatmap (5% Decision Threshold)

Diagnostic Metrics

Figure 9: ROC and Precision-Recall Curves for Out-of-Sample Data

References

[1]

G. Aminian et al., “FraudTransformer: Time-Aware GPT for Transaction Fraud Detection.” arXiv, Oct. 2025. doi: 10.48550/arXiv.2509.23712.

[2]

C. Iscan, O. Kumas, F. P. Akbulut, and A. Akbulut, “Wallet-Based Transaction Fraud Prevention Through LightGBM With the Focus on Minimizing False Alarms,” IEEE Access, vol. 11, pp. 131465–131474, 2023, doi: 10.1109/ACCESS.2023.3321666.

[3]

X. Zhao, Y. Liu, and Q. Zhao, “Improved LightGBM for Extremely Imbalanced Data and Application to Credit Card Fraud Detection,” IEEE Access, vol. 12, pp. 159316–159335, 2024, doi: 10.1109/ACCESS.2024.3487212.