Certified Unlearning for Neural Networks

Abstract

We address the problem of machine unlearning, where the goal is to remove the influence of specific training data from a model upon request, motivated by privacy and regulatory requirements such as the “right to be forgotten.” Unfortunately, existing methods rely on restrictive assumptions or lack formal guarantees. To this end, we propose a novel method for certified machine unlearning, leveraging the connection between unlearning and privacy amplification by stochastic post-processing. Our method uses noisy fine-tuning on the retain data, i.e., data that does not need to be removed, to ensure provable unlearning guarantees. This approach requires no assumptions about the underlying loss function, making it broadly applicable across diverse settings. We analyze the theoretical trade-offs in efficiency and accuracy and demonstrate empirically that our method not only achieves formal unlearning guarantees but also performs effectively in practice, outperforming existing baselines.

Summary

This paper introduces a new framework for certified approximate machine unlearning that is the first to provide provable (epsilon, delta)-unlearning guarantees for non-convex tasks such as deep neural networks without requiring smoothness or convexity assumptions on the loss function. The core insight is to interpret noisy fine-tuning steps on the retain data as stochastic post-processing operations, leveraging the theory of privacy amplification by iteration from differential privacy. Each step of noisy SGD on the retained data progressively amplifies privacy with respect to the forgotten data, allowing the required noise magnitude to decrease as more iterations are performed.

The authors propose two concrete algorithmic variants: gradient clipping and model clipping. In gradient clipping, gradients are clipped to a bounded norm before applying updates with Gaussian noise, ensuring each update acts as a bounded-sensitivity stochastic post-processing step. In model clipping, the entire model update is projected onto a ball of fixed radius after each step. Both methods are initialized from the original trained model (clipped to a bounded region) and fine-tuned exclusively on the retain set. The theoretical analysis extends the privacy amplification by iteration framework of Feldman et al. (2018) and Balle et al. (2019) from convex to non-convex settings, using a novel shift-reduction lemma to control Renyi divergence growth across iterations.

Experiments on MNIST and CIFAR-10 (and transfer learning on CIFAR-100 with ResNet-18) demonstrate that both gradient clipping and model clipping achieve target (1, 10^{-5})-unlearning while consistently outperforming output perturbation and retraining-from-scratch baselines in terms of compute efficiency, achieving up to 50% savings in required training epochs for a given accuracy target. The method also significantly outperforms DP-SGD (group privacy) in the transfer learning setting.

Key Contributions

First certified unlearning method for non-convex tasks (including deep learning) that imposes no assumptions on loss function smoothness or convexity.
Two algorithmic variants — gradient clipping for unlearning and model clipping for unlearning — with formal (epsilon, delta)-unlearning guarantees derived from privacy amplification by iteration.
Theoretical comparison showing that iterative noisy fine-tuning substantially reduces the required noise magnitude per iteration compared to one-shot output perturbation for unlearning.
Empirical validation on MNIST, CIFAR-10, and CIFAR-100 demonstrating practical effectiveness with up to 50% compute savings over retraining from scratch.

Methodology

The framework operates in three phases: (1) clip the original trained model to a bounded region (radius C_0); (2) perform T iterations of noisy SGD on the retain set with either gradient clipping (radius C_1) or model clipping (radius C_2), adding Gaussian noise at each step; (3) optionally continue fine-tuning without noise to recover accuracy. The unlearning guarantee follows from showing that the output distribution of this process is (epsilon, delta)-indistinguishable from a certifying algorithm that never saw the forget data, using certified approximate unlearning (the formal (epsilon, delta) framework inspired by differential privacy). The proof leverages Renyi divergence and hockey-stick divergence to track privacy amplification across iterations.

Key Findings

Without regularization, gradient clipping requires noise variance proportional to (C_0 + C_1 * gamma * T)^2 / (epsilon^2 * T), which decreases with more iterations T.
With L2 regularization, the dependence on the initial clipping radius C_0 decays exponentially in T, enabling logarithmic iteration counts.
Model clipping achieves (epsilon, delta)-unlearning in T = O(log(1/delta)) iterations with noise independent of the initial model norm, reducing per-iteration noise by a factor of C_0^2 / C_2^2 compared to output perturbation.
On CIFAR-10, both methods reach target accuracies 20-50% faster than retraining from scratch.
In transfer learning (ResNet-18 on CIFAR-100), gradient clipping achieves 60% accuracy in 32 epochs vs. 34 for retrain, while DP-SGD never exceeds 20% within the 50-epoch budget.

Important References

The Algorithmic Foundations of Differential Privacy — Foundational (epsilon, delta)-differential privacy framework that inspires the unlearning definition used in this paper.
Deep Learning with Differential Privacy — Introduces DP-SGD with gradient clipping and noise addition; the gradient clipping approach in this paper extends DP-SGD ideas to the unlearning setting.
Machine Unlearning — Bourtoule et al. (2021) provide the SISA framework for exact unlearning via sharded retraining; this paper’s approximate approach offers an alternative with formal certificates.

Atomic Notes

paper

Alethograph

Explorer