The influence on bias (I_{up,bias}) is an extension of classical influence functions that measures how each training sample contributes to a model’s bias rather than its overall loss. It enables identifying which specific training examples are responsible for learned biases, providing both interpretability and a principled selection mechanism for debiasing via unlearning.
Given a bias measurement B(θ̂) (e.g., counterfactual bias, demographic parity gap), the influence of removing training sample z_k on the bias is: I_{up,bias}(z_k, B(θ̂)) = dB(θ̂_{ε,z_k})/dε |{ε=0} = -∇_θ B(θ̂) H_θ̂⁻¹ ∇_θ L(z_k, θ̂), where θ̂{ε,z_k} are the parameters after upweighting z_k by ε, and H_θ̂ is the Hessian.
Key Details
- Decomposition: Product of (bias sensitivity to parameters) × (parameter sensitivity to sample)
- Sign interpretation: Positive I_{up,bias} → removing z_k decreases bias (z_k is harmful); negative → helpful
- Extensible: B(θ̂) can be counterfactual bias, demographic parity gap, or equal opportunity difference
- Computational cost: Requires one Hessian-vector product per sample via implicit HVPs, avoidable with pre-computation
- Deep networks: Applied to last classifier layer where θ̂ is approximately a local optimum
- The formula shares computation with standard influence functions I_{up,params}(z_k) = H_θ̂⁻¹ ∇_θ L(z_k, θ̂)