SCRUB (SCalable Remembering and Unlearning unBound) is a deep machine unlearning algorithm proposed in Kurmanji et al. (2023). It casts unlearning as a teacher-student problem where the original (pre-unlearning) model serves as the teacher and the unlearned model is the student. The student selectively obeys the teacher: matching it on retained data while diverging from it on forget data. This approach is termed “unbounded” because it does not depend on the limiting assumptions (small forget sets, stability of SGD, closeness in weight space) that constrain prior certified methods.
The training objective combines three terms: (1) minimize KL divergence between student and teacher outputs on the retain set (imitate the teacher on data to remember), (2) maximize KL divergence on the forget set (disobey the teacher on data to forget), and (3) minimize cross-entropy loss on the retain set (maintain classification performance). Formally:
min_{w^u} alpha/N_r * sum_{D_r} D_KL(p(f(x; w^o)) || p(f(x; w^u))) + gamma/N_r * sum_{D_r} l(f(x; w^u), y) - 1/N_f * sum_{D_f} D_KL(p(f(x; w^o)) || p(f(x; w^u)))
The min-max nature of this objective (simultaneously pulling toward and pushing away from the teacher) creates optimization challenges similar to GANs. SCRUB addresses this by alternating between max-steps (ascending on the forget set) and min-steps (descending on the retain set), with additional min-steps at the end to restore any utility loss.
Algorithm
- Initialize student weights w^u = w^o (teacher weights)
- Repeat until convergence: a. If within budget, perform DO-MAX-EPOCH: gradient ascent on forget set to maximize d(x_f; w^u) for all x_f in D_f b. Perform DO-MIN-EPOCH: gradient descent on retain set to minimize KL divergence + cross-entropy on D_r
- SCRUB stops when forget error has increased without harming retain error
Key Properties
- Application-agnostic: consistently top performer across removing biases (RB), resolving confusion (RC), and user privacy (UP) scenarios
- Does not require theoretical assumptions: unlike NTK-based or Fisher-based methods, makes no assumptions about model closeness to optimum, SGD stability, or forget set size
- Scalable: does not require storing the full training dataset; computational cost does not scale quadratically with dataset size (unlike NTK methods)
- No formal guarantees: SCRUB is empirically validated but does not provide (epsilon, delta)-certified approximate unlearning certificates; developing such guarantees is noted as future work
- Contrastive flavour: the objective simultaneously pulls the student toward the teacher (retain) and pushes it away (forget), similar to contrastive learning
- Empirically outperforms Finetuning, NTK Forgetting, Fisher Forgetting, CF-k, EU-k, Bad-T, and NegGrad+ across CIFAR-10 and Lacuna-10 on ResNet-18 and All-CNN architectures
- Extended by SCRUB+R with a rewinding procedure for improved MIA defense