Abstract
Recent discoveries have revealed that deep neural networks might behave in a biased manner in many real-world scenarios. Existing debiasing methods suffer from high costs in bias labeling or model re-training, while also exhibiting a deficiency in terms of elucidating the origins of biases within the model. We propose a fast model debiasing framework (FMD) which offers an efficient approach to identify, evaluate and remove biases inherent in trained models. The FMD identifies biased attributes through an explicit counterfactual concept and quantifies the influence of data samples with influence functions. Moreover, we design a machine unlearning-based strategy to efficiently and effectively remove the bias in a trained model with a small counterfactual dataset.
Summary
Fast Model Debias (FMD) is a three-stage framework for post-hoc model debiasing that uses machine unlearning to remove learned biases without retraining. Unlike fair unlearning methods that modify the training objective, FMD operates on already-trained models by: (1) identifying biases through counterfactual fairness analysis, (2) quantifying which training samples contribute most to bias via a novel influence function on bias metrics, and (3) unlearning the identified harmful samples using Newton-step updates with counterfactual replacement.
The key innovation is the “influence on bias” function I_{up,bias}(z_k, B(θ̂)) which measures how each training sample z_k contributes to a chosen bias metric B (e.g., counterfactual bias, demographic parity). Harmful samples (those increasing bias) are then unlearned not just by removing their influence, but by replacing them with their counterfactual versions, ensuring fairness is actively promoted. An alternative strategy uses cheap external counterfactual datasets when training data is unavailable.
Key Contributions
- Counterfactual inference-based bias identification that quantitatively measures bias degree
- Novel influence function on bias (I_{up,bias}) that traces bias back to individual training samples
- Unlearning-based debiasing via Newton step with counterfactual sample replacement
- Alternative method using external counterfactual datasets when training data is unavailable
- Demonstrated on deep networks (ResNet, BERT, GPT-2) and tabular data
Methodology
Three-stage pipeline: (1) Bias Identification: Construct counterfactual dataset D_ex by flipping protected attributes; measure counterfactual bias B(c_i, A, θ̂). (2) Biased-Effect Evaluation: Compute I_{up,bias}(z_k, B) = -∇_θ B(θ̂) H_θ̂⁻¹ ∇_θ L(z_k, θ̂) for each training sample. (3) Bias Removal: Select top-K harmful samples and update: θ_new = θ̂ + Σ_k H_θ̂⁻¹(∇_θ L(z_k, θ̂) - ∇_θ L(z̃_k, θ̂)), where z̃_k is the bias-conflicting counterfactual. Uses influence functions and Newton update removal mechanism.
Key Findings
- FMD achieves lowest bias with competing accuracy using only 500-5000 counterfactual samples vs. 26,904+ for baselines
- Debiasing time is 2-3 seconds vs. hundreds of seconds for in-processing methods on CelebA
- Harmful samples are bias-aligned (e.g., <blonde, female> pairs) while helpful samples are bias-conflicting
- Counterfactual-based unlearning (Eq. 8) outperforms simple removal (Eq. 7) in both accuracy and bias
- External dataset strategy (Eq. 9) provides satisfactory debiasing when training data unavailable
- Effective on LLMs: reduces stereotypical associations in BERT and GPT-2 on StereoSet
Important References
- Understanding Black-box Predictions via Influence Functions — Koh & Liang (2017), influence function framework
- Certified Data Removal from Machine Learning Models — Guo et al. (2020), Newton-step unlearning
- Counterfactual Fairness — Kusner et al. (2017), counterfactual fairness definition