Summary

Fast Model Debias (FMD) is a three-stage framework for post-hoc model debiasing that uses machine unlearning to remove learned biases without retraining. Unlike fair unlearning methods that modify the training objective, FMD operates on already-trained models by: (1) identifying biases through counterfactual fairness analysis, (2) quantifying which training samples contribute most to bias via a novel influence function on bias metrics, and (3) unlearning the identified harmful samples using Newton-step updates with counterfactual replacement.

The key innovation is the “influence on bias” function I_{up,bias}(z_k, B(θ̂)) which measures how each training sample z_k contributes to a chosen bias metric B (e.g., counterfactual bias, demographic parity). Harmful samples (those increasing bias) are then unlearned not just by removing their influence, but by replacing them with their counterfactual versions, ensuring fairness is actively promoted. An alternative strategy uses cheap external counterfactual datasets when training data is unavailable.

Key Contributions

  • Counterfactual inference-based bias identification that quantitatively measures bias degree
  • Novel influence function on bias (I_{up,bias}) that traces bias back to individual training samples
  • Unlearning-based debiasing via Newton step with counterfactual sample replacement
  • Alternative method using external counterfactual datasets when training data is unavailable
  • Demonstrated on deep networks (ResNet, BERT, GPT-2) and tabular data

Methodology

Three-stage pipeline: (1) Bias Identification: Construct counterfactual dataset D_ex by flipping protected attributes; measure counterfactual bias B(c_i, A, θ̂). (2) Biased-Effect Evaluation: Compute I_{up,bias}(z_k, B) = -∇_θ B(θ̂) H_θ̂⁻¹ ∇_θ L(z_k, θ̂) for each training sample. (3) Bias Removal: Select top-K harmful samples and update: θ_new = θ̂ + Σ_k H_θ̂⁻¹(∇_θ L(z_k, θ̂) - ∇_θ L(z̃_k, θ̂)), where z̃_k is the bias-conflicting counterfactual. Uses influence functions and Newton update removal mechanism.

Key Findings

  • FMD achieves lowest bias with competing accuracy using only 500-5000 counterfactual samples vs. 26,904+ for baselines
  • Debiasing time is 2-3 seconds vs. hundreds of seconds for in-processing methods on CelebA
  • Harmful samples are bias-aligned (e.g., <blonde, female> pairs) while helpful samples are bias-conflicting
  • Counterfactual-based unlearning (Eq. 8) outperforms simple removal (Eq. 7) in both accuracy and bias
  • External dataset strategy (Eq. 9) provides satisfactory debiasing when training data unavailable
  • Effective on LLMs: reduces stereotypical associations in BERT and GPT-2 on StereoSet

Important References

  1. Understanding Black-box Predictions via Influence Functions — Koh & Liang (2017), influence function framework
  2. Certified Data Removal from Machine Learning Models — Guo et al. (2020), Newton-step unlearning
  3. Counterfactual Fairness — Kusner et al. (2017), counterfactual fairness definition

Atomic Notes


paper