Skip to contents

Data Ingestion & Lakehouse Setup

Functions for moving raw CSV data into the MinIO Lakehouse as partitioned Parquet.

baflakehouse-package
baflakehouse: Lakehouse Workflow for the Bank Account Fraud Dataset
convert_to_parquet()
Convert BAF CSV to partitioned Parquet in MinIO (S3)
connect_baf()
Connect to BAF dataset on MinIO (Arrow or DuckDB)
clean_baf_base()
Clean the BAF Base dataset and write to 03_primary

Feature Engineering & Preprocessing

Recipes and transformations applied across the pipeline layers.

engineer_features()
Engineer features for the BAF dataset
generate_model_inputs()
Generate Resampled Model Inputs
build_eda_recipe()
Build EDA Recipe
build_baf_recipe()
Build Untrained BAF Recipe

Exploratory Data Analysis

Diagnostic model and visualizations for understanding the fraud signal.

train_diag_model()
Train Diagnostic Model
plot_var_imp()
Plot Variable Importance
plot_hexbin_interaction()
Plot Hexbin Interaction
plot_missingness()
Plot Missingness Signal
plot_num_cor()
Plot Numeric Correlation Matrix

Model Selection & Tuning

Imbalance strategy tournament, hyperparameter tuning, and results formatting.

run_imbalance_tournament()
Run Class Imbalance Tournament
tune_lgbm()
Tune LightGBM Hyperparameters
format_tournament_gt()
Format Tournament Results Table
plot_efficiency()
Plot Effectiveness vs Efficiency

Final Evaluation & Production Deployment

Holdout evaluation on months 6-7 and MinIO model artifact serialization.

evaluate_final_model()
Final Model Evaluation (Months 6 & 7)
train_production_model()
Train and Serialize Production LightGBM Model

Reporting

Figures, tables, and slide rendering for the Quarto presentation.

plot_fraud_by_month()
Plot applications by month (Legit vs Fraud) on a log scale
plot_conf_mat_heatmap()
Plot Confusion Matrix Heatmap
compute_fraud_by_month()
Fraud prevalence by month (counts + percent)
format_fraud_by_month_gt()
Format fraud-by-month table as a gt object
save_report_figure()
Save a report figure artifact
save_report_table()
Save a report table artifact
render_slides()
Render Quarto revealjs slideshow after required assets exist