Fast Model Debias with Machine Unlearning

Abstract

Recent discoveries have revealed that deep neural networks might behave in a biased manner in many real-world scenarios. Existing debiasing methods suffer from high costs in bias labeling or model re-training, while also exhibiting a deficiency in terms of elucidating the origins of biases within the model. We propose a fast model debiasing framework (FMD) which offers an efficient approach to identify, evaluate and remove biases inherent in trained models. The FMD identifies biased attributes through an explicit counterfactual concept and quantifies the influence of data samples with influence functions. Moreover, we design a machine unlearning-based strategy to efficiently and effectively remove the bias in a trained model with a small counterfactual dataset.

Summary

Fast Model Debias (FMD) is a three-stage framework for post-hoc model debiasing that uses machine unlearning to remove learned biases without retraining. Unlike fair unlearning methods that modify the training objective, FMD operates on already-trained models by: (1) identifying biases through counterfactual fairness analysis, (2) quantifying which training samples contribute most to bias via a novel influence function on bias metrics, and (3) unlearning the identified harmful samples using Newton-step updates with counterfactual replacement.

The key innovation is the “influence on bias” function I_{up,bias}(z_k, B(θ̂)) which measures how each training sample z_k contributes to a chosen bias metric B (e.g., counterfactual bias, demographic parity). Harmful samples (those increasing bias) are then unlearned not just by removing their influence, but by replacing them with their counterfactual versions, ensuring fairness is actively promoted. An alternative strategy uses cheap external counterfactual datasets when training data is unavailable.

Key Contributions

Counterfactual inference-based bias identification that quantitatively measures bias degree
Novel influence function on bias (I_{up,bias}) that traces bias back to individual training samples
Unlearning-based debiasing via Newton step with counterfactual sample replacement
Alternative method using external counterfactual datasets when training data is unavailable
Demonstrated on deep networks (ResNet, BERT, GPT-2) and tabular data

Methodology

Three-stage pipeline: (1) Bias Identification: Construct counterfactual dataset D_ex by flipping protected attributes; measure counterfactual bias B(c_i, A, θ̂). (2) Biased-Effect Evaluation: Compute I_{up,bias}(z_k, B) = -∇_θ B(θ̂) H_θ̂⁻¹ ∇_θ L(z_k, θ̂) for each training sample. (3) Bias Removal: Select top-K harmful samples and update: θ_new = θ̂ + Σ_k H_θ̂⁻¹(∇_θ L(z_k, θ̂) - ∇_θ L(z̃_k, θ̂)), where z̃_k is the bias-conflicting counterfactual. Uses influence functions and Newton update removal mechanism.

Key Findings

FMD achieves lowest bias with competing accuracy using only 500-5000 counterfactual samples vs. 26,904+ for baselines
Debiasing time is 2-3 seconds vs. hundreds of seconds for in-processing methods on CelebA
Harmful samples are bias-aligned (e.g., <blonde, female> pairs) while helpful samples are bias-conflicting
Counterfactual-based unlearning (Eq. 8) outperforms simple removal (Eq. 7) in both accuracy and bias
External dataset strategy (Eq. 9) provides satisfactory debiasing when training data unavailable
Effective on LLMs: reduces stereotypical associations in BERT and GPT-2 on StereoSet

Important References

Understanding Black-box Predictions via Influence Functions — Koh & Liang (2017), influence function framework
Certified Data Removal from Machine Learning Models — Guo et al. (2020), Newton-step unlearning
Counterfactual Fairness — Kusner et al. (2017), counterfactual fairness definition

Atomic Notes

paper

Alethograph

Explorer

Fast Model Debias with Machine Unlearning

Summary

Key Contributions

Methodology

Key Findings

Important References

Atomic Notes

Graph View

Table of Contents

Backlinks