The data processing inequality (DPI) for Renyi divergence states that processing data through a Markov kernel (channel) can never increase the Renyi divergence between distributions. This is the fundamental monotonicity property that enables privacy amplification arguments in certified approximate unlearning and differential privacy.
Formal Statement
Theorem 1 (Van Erven & Harremos, 2014, for simple orders): For any simple order alpha (i.e., alpha > 0, alpha != 1) and any sub-sigma-algebra G subset of F:
D_alpha(P|_G || Q|_G) ⇐ D_alpha(P || Q)
Theorem 9 (Van Erven & Harremos, 2014, for all orders): For any order alpha in [0, infinity] and any sub-sigma-algebra G subset of F:
D_alpha(P|_G || Q|_G) ⇐ D_alpha(P || Q)
Equivalently, if X → Y is a Markov chain with transition probabilities A(Y|X), then:
D_alpha(P_Y || Q_Y) ⇐ D_alpha(P_X || Q_X)
for all alpha in [0, infinity].
Proof Sketch
The proof for simple orders (Theorem 1 in VANERVEN2014) follows from the interpretation of conditioning on G as coarsening the partition of the sample space, together with Proposition 1 (relating conditional densities to conditional expectations). The key steps are:
-
For alpha > 1, by Jensen’s inequality applied to the conditional expectation: (1/(alpha-1)) * ln(integral of (dP_tilde/dQ)^alpha dQ) >= (1/(alpha-1)) * ln(integral of (E[dP_tilde/dQ | G])^alpha dQ), using the convexity of x → x^alpha.
-
The extension to all orders alpha in [0, infinity] (Theorem 9) uses the monotonicity of D_alpha in alpha (Theorem 3), continuity at the extended orders, and the fact that the DPI holds for all simple orders.
The name “data processing inequality” comes from the information-theoretic interpretation: if Y = f(X) is any deterministic or randomized function of X, then Y carries at most as much information about the distinction between P and Q as X does. “Processing” the data can only destroy information.
Role in Koloskova et al. (2025)
In KOLOSKOVA2025, the DPI for Renyi divergence is used in the gradient clipping analysis (Lemma A.7 and surrounding arguments):
-
Noise addition step: When Gaussian noise Z_t is added to the clipped gradient, the resulting model distribution at step t+1 satisfies D_alpha(mu_{t+1} || mu’_{t+1}) ⇐ D_alpha(mu_t K_t || mu’_t K_t), where K_t is the Gaussian noise kernel. The DPI ensures this does not increase divergence.
-
Projection step: When model parameters are projected onto a convex set W (e.g., via model clipping), this is a deterministic post-processing step. By the DPI, D_alpha(Pi_W(mu) || Pi_W(mu’)) ⇐ D_alpha(mu || mu’), so projection cannot increase divergence.
-
Composition across iterations: The DPI is applied iteratively across T SGD steps, combined with the shift-reduction lemma that controls how much the gradient clipping or model clipping reduces the divergence at each step.
The DPI for Renyi divergence is also the foundation of Renyi differential privacy (RDP): any post-processing of an (alpha, epsilon)-RDP mechanism preserves the (alpha, epsilon)-RDP guarantee, which follows directly from the DPI.
Relation to DPI for Other Divergences
The data processing inequality holds for all f-divergences (including KL divergence, total variation, chi-squared divergence, and hockey-stick divergence). For hockey-stick divergence, a stronger “contraction” version is available: the E-gamma divergence contraction coefficient gives a multiplicative bound on how much E_gamma can decrease under a kernel, which is strictly less than 1 for bounded-domain Gaussian kernels.