PythonMachine LearningFairnessSHAPEU AI ActFinance

Fair Lending Audit — Credit Risk Under the EU AI Act

When a model works and still fails.

2026

Banks do not read credit applications. Not at scale, not really. What actually happens is this: a machine learning model receives a row of numbers describing the applicant, runs it through a learned function, and returns a probability. High enough: approved. Below the threshold: rejected. The applicant never enters the picture. The model sees a feature vector. Account balance, loan history, employment status, requested amount. From those numbers, it constructs a judgement.

The EU Artificial Intelligence Act, entering full force in August 2026, classifies consumer credit scoring as a high-risk AI application under Annex III. The classification carries a specific obligation: models must not produce systematically discriminatory outcomes for legally protected groups. A scoring system that approves fewer women than equally creditworthy men, not because of financial differences but because it learned patterns from historically biased data, is non-compliant. Banks deploying non-compliant high-risk AI face fines of up to 15 million EUR or 3% of global annual turnover, whichever is higher. For a bank with 10 billion EUR in revenue, that is 300 million EUR. The incentive to pay attention is not subtle.

This creates a concrete compliance requirement, and regulatory goodwill is not sufficient to meet it. Under Art. 9 and Art. 10 of the Act, banks must actively identify, measure, and mitigate harmful bias in their models, then document that they did so. Most deployed scoring systems do not currently meet these requirements. The August 2026 deadline is closer than it sounds: conformity assessments and model audits take time. Banks starting this process now are acting on schedule. Banks that are not are accumulating exposure.

HOW COMPLIANCE IS MEASURED

Testing a model for discriminatory outcomes requires specific metrics. The primary one is the Disparate Impact Ratio (DIR): the approval rate of the protected group divided by the approval rate of the reference group. The 0.80 threshold, known as the four-fifths rule, originates from US employment law (EEOC Uniform Guidelines, 1978) and has since become the standard benchmark in algorithmic fairness practice. Worth noting: the EU AI Act does not actually mandate this specific number. Art. 10(2)(f) requires documented identification and mitigation of harmful biases. The 0.80 figure is the profession's answer to the question of where the line is. Below it, disparity is considered material and remediation becomes unavoidable.

Three additional metrics decompose the problem further. Statistical Parity Difference (SPD) measures the raw approval-rate gap between groups. Equal Opportunity Difference (EOD) isolates the gap among creditworthy applicants specifically: people the model rejects despite being likely to repay. Predictive Parity Difference (PPD) tests whether the same risk score means the same thing for both groups. Each metric catches a different failure mode. A model can pass one and fail another. This one fails several.

WHAT THIS PROJECT DOES

This notebook applies the compliance framework above to a concrete case: the German Credit Data, 1,000 loan applications, 21 features, a binary repayment outcome. Two classifiers are trained, Logistic Regression and Gradient Boosting, neither of which uses gender as a direct input, as required under EU Directive 2004/113/EC. Age is also excluded as a modelling choice to limit proxy discrimination risk, though unlike gender it is not statutorily prohibited as a credit variable. Both models train on financial data only: account status, loan history, requested amount, duration, employment type. Then each is audited against all four fairness metrics, across both protected attributes.

The choice of two classifiers is deliberate. Logistic Regression is the industry standard for consumer credit because it is interpretable: regulators can ask what weight the model placed on any feature and receive a direct answer. Gradient Boosting tends to be more accurate but operates as a black box. Comparing the two makes a well-known trade-off concrete. The more powerful model is both more accurate and more discriminatory. That is what happens when a model is allowed to optimise harder on a dataset that carries demographic signal in its financial variables. More capacity means learning more, including the parts you would rather it did not.

FINDINGS

The gender DIR falls below 0.80. The model approves women at less than 80% of the rate it approves men with comparable financial profiles. The age DIR for applicants between 18 and 25 versus those over 50 is lower still. The model fails multiple fairness thresholds at once. Under the EU AI Act, this constitutes a documented compliance gap that triggers mandatory risk management and remediation obligations. The source is the training data: in this dataset, women default at 35% and men at 28%. The model learned that signal faithfully. That is precisely the problem. It was trained on historical data that already reflected structural economic disadvantage, not inherent differences in creditworthiness. A model can be doing its job perfectly and producing discriminatory outcomes at the same time.

The dominant predictor is account status: whether the applicant holds a checking account with a positive balance. Credit amount and loan duration follow. These are legitimate financial variables. They also correlate with age and, in the German context of this dataset, with gender. The model did not learn to discriminate directly. It learned the financial correlates of creditworthiness, and those correlates happen to be unevenly distributed across demographic groups. This is a subtler problem than explicit bias, and a harder one to fix. SHAP (SHapley Additive exPlanations) makes the decomposition visible at both portfolio level and for individual decisions, and provides the technical foundation for the separate explanation obligation under EU AI Act Art. 86 and GDPR Art. 22(3).

THE COST OF INACTION

The regulatory exposure is the primary financial risk. Fines up to 15 million EUR or 3% of global annual turnover, plus the operational risk of BaFin restricting or suspending the scoring model entirely. Losing the ability to run automated credit decisioning is not a fine. It is a business interruption. On top of that sits a secondary revenue impact. Scaling the test-set false rejection rates to the full portfolio at an 8% annual rate produces an estimated DM 138,412 in annual revenue at risk, of which DM 47,575 (34%) is attributable to the gender disparity. Relative to the regulatory ceiling, this number is small. It is included here not because it is the biggest risk, but because it makes the bias legible to an audience that does not speak compliance.

KEY_FINDINGS

regulatory_exposure = up to 15M EUR or 3% annual turnover // EU AI Act Art. 99(3), high-risk AI

gender_DIR < 0.80 // fails four-fifths benchmark; triggers Art. 10(2)(f) remediation obligation

age_DIR (18-25 vs 51+) < 0.80 // youngest cohort most affected

top_predictor = status_account // demographic proxy, not explicit attribute

revenue_at_risk = DM 138,412 / yr // secondary cost; 34% attributable to gender bias

The audit does not conclude that the model should not be used. It concludes that the model cannot be deployed as-is without exposing the bank to regulatory sanction. The final module outlines four remediation steps in order of priority: rebalancing the training data, applying post-processing threshold adjustments, establishing continuous fairness monitoring in production, and integrating SHAP-based explanations into the rejection workflow. None of these are exotic. The gap between knowing about them and having actually deployed them is where the regulatory risk lives, and where the August 2026 deadline stops being abstract.

02 / analysis

Jupyter Notebook

> Full analysis — rendered from the original .ipynb file.

chocolate_sales_credibility_check.ipynb

Source code:> github/bvn3141