Summary

This paper is the definitive reference on Renyi divergence, providing a unified and rigorous treatment of its properties for all orders alpha in [0, infinity] on general measurable spaces. It is the primary source for the Renyi divergence properties used throughout the machine unlearning literature, particularly in Koloskova et al. (2025) and Chien et al. (2024).

The paper begins by extending the definition of Renyi divergence from finite alphabets to arbitrary measurable spaces via two equivalent routes: the integral formula and the discretization supremum. For simple orders alpha in (0, infinity) \ {1}, the Renyi divergence of P from Q is:

D_alpha(P || Q) = (1 / (alpha - 1)) * ln(integral of p^alpha * q^(1-alpha) d mu)

where p and q are densities with respect to any common dominating measure mu. The extended orders 0, 1, and infinity are defined by limits: D_1(P || Q) = D(P || Q) (the KL divergence), D_0(P || Q) = -ln Q(p > 0), and D_infinity(P || Q) = ln ess sup (p/q).

Key properties established include: Renyi divergence is nondecreasing in alpha (Theorem 3), nonnegative for all orders (Theorem 8), and satisfies the data processing inequality for Renyi divergence for all orders alpha in [0, infinity] (Theorems 1 and 9). The paper proves joint convexity of D_alpha in (P, Q) for alpha in [0, 1] (Theorem 11), convexity in Q for all alpha in [0, infinity] (Theorem 12), and joint quasi-convexity for all alpha (Theorem 13). The skew symmetry relation D_alpha(P || Q) = (alpha / (1 - alpha)) * D_{1-alpha}(Q || P) for alpha in (0, 1) (Proposition 2) connects different orders.

For Gaussian distributions, the Renyi divergence between N(mu_0, sigma_0^2) and N(mu_1, sigma_1^2) has the explicit formula: D_alpha = alpha * (mu_1 - mu_0)^2 / (2 * sigma_alpha^2) + (1 / (1 - alpha)) * ln(sigma_alpha / (sigma_0^{1-alpha} * sigma_1^alpha)), where sigma_alpha^2 = (1 - alpha) * sigma_0^2 + alpha * sigma_1^2 (eq. 10).

The Pythagorean inequality is generalized from KL divergence to all alpha in (0, infinity) (Theorem 14): for any alpha-convex set P and the alpha-information projection P*, D_alpha(P || Q) >= D_alpha(P || P*) + D_alpha(P* || Q).

The paper also establishes Pinsker’s inequality generalized to all orders alpha in (0, 1]: (alpha / 2) * V^2(P, Q) D_alpha(P || Q), and the relation between Renyi divergence and Chernoff information (Theorem 30, 32).

Relevance to Koloskova et al. (2025)

This paper provides the theoretical foundations for the Renyi divergence tools used in Koloskova et al.’s unlearning analysis:

  1. Lemma A.7 (Koloskova) uses the data processing inequality for Renyi divergence (Theorem 9 here): D_alpha(P_{Y} || Q_{Y}) D_alpha(P_X || Q_X) when Y is obtained by processing X through a Markov kernel. This is applied to show that each noisy SGD iteration does not increase the Renyi divergence between adjacent model distributions.

  2. The Renyi divergence and KL relationship (Theorem 5 here, D_1 = KL divergence) underpins the conversion from Renyi divergence bounds to (epsilon, delta)-DP guarantees via standard RDP-to-DP conversion.

  3. The Gaussian Renyi divergence formula (eq. 10) is used when analyzing the noise addition steps in the gradient clipping algorithm.

  4. The Renyi unlearning definition used in Chien et al. (2024) and related works is built directly on the Renyi divergence as defined here.


paper