Abstract
Many commonly used learning algorithms work by iteratively updating an intermediate solution using one or a few data points in each iteration. Analysis of differential privacy for such algorithms often involves ensuring privacy of each step and then reasoning about the cumulative privacy cost of the algorithm. This is enabled by composition theorems for differential privacy that allow releasing of all the intermediate results. In this work, we demonstrate that for contractive iterations, not releasing the intermediate results strongly amplifies the privacy guarantees. We describe several applications of this new analysis technique to solving convex optimization problems via noisy stochastic gradient descent.
Summary
This paper introduces a fundamentally new approach to analyzing the privacy of iterative learning algorithms: privacy amplification by iteration. Prior to this work, the standard approach to differential privacy analysis of iterative algorithms relied on composition theorems, which assume all intermediate outputs are released and simply accumulate privacy costs additively. Feldman et al. show that when iterations are contractive (i.e., 1-Lipschitz maps) and only the final iterate is released, the cumulative privacy guarantee is substantially stronger than what composition would predict.
The core technical framework is built around the Contractive Noisy Iteration (CNI): given an initial state X_0, a sequence of contractive maps {psi_t}, and noise distributions {zeta_t}, the update rule is X_{t+1} = psi_{t+1}(X_t) + Z_{t+1}. The central insight is that applying contractive maps between noise additions causes the Renyi divergence between the outputs of two CNI processes (started from different initial points x_0 and x_0’) to shrink. The main result (Theorem 22) states that D_alpha^{(z_T)}(X_T || X_T’) ⇐ sum_{t=1}^{T} R_alpha(zeta_t, a_t), where the a_t values can be chosen to trade off between contributions from the noise magnitude and the contraction.
To prove this, the authors introduce a crucial new tool: the shifted Renyi divergence D_alpha^{(z)}(mu || nu), which interpolates between a metric distance (the infinity-Wasserstein distance W_infinity) and the standard Renyi divergence. The proof proceeds by induction, using two key lemmas: the shift-reduction lemma (Lemma 20), which shows how adding noise converts a shifted Renyi divergence into one with a larger shift parameter plus a noise penalty R_alpha(zeta, a); and the contraction-reduces-D_alpha^{(z)} lemma (Lemma 21), which shows that applying contractive maps cannot increase the shifted divergence.
The application to noisy SGD on convex, smooth, L-Lipschitz objectives demonstrates per-person privacy guarantees that are strongest for data points processed earliest. Specifically, PNSGD satisfies (alpha, alpha*epsilon/(n+1-t))-RDP for its t’th input, where epsilon = 2L^2/sigma^2. This implies that data points used early in training enjoy privacy levels proportional to 1/(n-t+1), while the last data point’s guarantee equals the single-step noise level. The paper also analyzes variants with random stopping (Stop-PNSGD) and random skipping (Skip-PNSGD).
Key Contributions
- Introduction of privacy amplification by iteration as an alternative to privacy amplification by sampling, applicable when the order of data processing need not be random or secret.
- Definition of the shifted Renyi divergence D_alpha^{(z)} as a hybrid between metric and information-theoretic distance measures, enabling the inductive proof technique.
- The shift-reduction lemma and contraction lemma that together enable tracking divergence evolution across contractive noisy iterations.
- Per-person RDP guarantees for noisy SGD: data points processed earlier receive proportionally stronger privacy protection, with the t’th point satisfying (alpha, alpha*epsilon/(n+1-t))-RDP.
- Applications to distributed SGD (no requirement for secret sampling order), multi-query optimization (solving k tasks at the same privacy cost as one), and public/private data settings.
Methodology
The PNSGD algorithm operates as follows: given a dataset S = {x_1, …, x_n}, starting point w_0 in a convex set K, learning rate eta, and noise scale sigma, for t = 0, …, n-1: (1) compute v_{t+1} = w_t - eta * grad_w f(w_t, x_{t+1}) + Z where Z ~ N(0, sigma^2 I_d); (2) project w_{t+1} = Pi_K(v_{t+1}). The key observation is that for beta-smooth convex losses with eta ⇐ 2/beta, the gradient step is contractive (Proposition 18), and projection onto a convex set is also contractive (Proposition 17). Thus the composition is a CNI, and Theorem 22 applies directly.
Key Findings
- For identical contractive maps and noise (the simplest case), Theorem 1 gives D_alpha(X_T || X_T’) ⇐ alpha * ||x_0 - x_0’|| / (2T * sigma^2), showing privacy improves linearly with T.
- For PNSGD on L-Lipschitz, beta-smooth convex losses: the t’th data point satisfies (alpha, alpha*epsilon/(n+1-t))-RDP with epsilon = 2L^2/sigma^2.
- Stop-PNSGD (random stopping) achieves (alpha, 4alphaL^2ln(n)/(nsigma^2))-RDP overall, nearly matching amplification-by-sampling bounds.
- The bounds are tight: the worst case (identity contractive maps) matches exactly.
- The smoothness assumption can be removed by convolving the loss with a Gaussian kernel, at a small additional approximation cost.
Connection to Unlearning
This paper provides the foundational privacy amplification machinery used in Certified Unlearning for Neural Networks (Koloskova et al., 2025). The shift-reduction lemma and contraction analysis form the template for Koloskova’s extension to non-convex settings via gradient clipping. The key difference is that Koloskova et al. do not require contractivity of the gradient step (which needs convexity), instead enforcing bounded sensitivity through gradient clipping for unlearning or model clipping for unlearning.