Abstract
Renyi divergence is related to Renyi entropy much like Kullback-Leibler divergence is related to Shannon’s entropy, and comes up in many settings. It was introduced by Renyi as a measure of information that satisfies almost the same axioms as Kullback-Leibler divergence, and depends on a parameter that is called its order. In particular, the Renyi divergence of order 1 equals the Kullback-Leibler divergence. We review and extend the most important properties of Renyi divergence and Kullback-Leibler divergence, including convexity, continuity, limits of sigma-algebras and the relation of the special order 0 to the Gaussian dichotomy and contiguity. We also show how to generalize the Pythagorean inequality to orders different from 1, and we extend the known equivalence between channel capacity and minimax redundancy to continuous channel inputs (for all orders) and present several other minimax results.
Summary
This paper is the definitive reference on Renyi divergence, providing a unified and rigorous treatment of its properties for all orders alpha in [0, infinity] on general measurable spaces. It is the primary source for the Renyi divergence properties used throughout the machine unlearning literature, particularly in Koloskova et al. (2025) and Chien et al. (2024).
The paper begins by extending the definition of Renyi divergence from finite alphabets to arbitrary measurable spaces via two equivalent routes: the integral formula and the discretization supremum. For simple orders alpha in (0, infinity) \ {1}, the Renyi divergence of P from Q is:
D_alpha(P || Q) = (1 / (alpha - 1)) * ln(integral of p^alpha * q^(1-alpha) d mu)
where p and q are densities with respect to any common dominating measure mu. The extended orders 0, 1, and infinity are defined by limits: D_1(P || Q) = D(P || Q) (the KL divergence), D_0(P || Q) = -ln Q(p > 0), and D_infinity(P || Q) = ln ess sup (p/q).
Key properties established include: Renyi divergence is nondecreasing in alpha (Theorem 3), nonnegative for all orders (Theorem 8), and satisfies the data processing inequality for Renyi divergence for all orders alpha in [0, infinity] (Theorems 1 and 9). The paper proves joint convexity of D_alpha in (P, Q) for alpha in [0, 1] (Theorem 11), convexity in Q for all alpha in [0, infinity] (Theorem 12), and joint quasi-convexity for all alpha (Theorem 13). The skew symmetry relation D_alpha(P || Q) = (alpha / (1 - alpha)) * D_{1-alpha}(Q || P) for alpha in (0, 1) (Proposition 2) connects different orders.
For Gaussian distributions, the Renyi divergence between N(mu_0, sigma_0^2) and N(mu_1, sigma_1^2) has the explicit formula: D_alpha = alpha * (mu_1 - mu_0)^2 / (2 * sigma_alpha^2) + (1 / (1 - alpha)) * ln(sigma_alpha / (sigma_0^{1-alpha} * sigma_1^alpha)), where sigma_alpha^2 = (1 - alpha) * sigma_0^2 + alpha * sigma_1^2 (eq. 10).
The Pythagorean inequality is generalized from KL divergence to all alpha in (0, infinity) (Theorem 14): for any alpha-convex set P and the alpha-information projection P*, D_alpha(P || Q) >= D_alpha(P || P*) + D_alpha(P* || Q).
The paper also establishes Pinsker’s inequality generalized to all orders alpha in (0, 1]: (alpha / 2) * V^2(P, Q) ⇐ D_alpha(P || Q), and the relation between Renyi divergence and Chernoff information (Theorem 30, 32).
Relevance to Koloskova et al. (2025)
This paper provides the theoretical foundations for the Renyi divergence tools used in Koloskova et al.’s unlearning analysis:
-
Lemma A.7 (Koloskova) uses the data processing inequality for Renyi divergence (Theorem 9 here): D_alpha(P_{Y} || Q_{Y}) ⇐ D_alpha(P_X || Q_X) when Y is obtained by processing X through a Markov kernel. This is applied to show that each noisy SGD iteration does not increase the Renyi divergence between adjacent model distributions.
-
The Renyi divergence and KL relationship (Theorem 5 here, D_1 = KL divergence) underpins the conversion from Renyi divergence bounds to (epsilon, delta)-DP guarantees via standard RDP-to-DP conversion.
-
The Gaussian Renyi divergence formula (eq. 10) is used when analyzing the noise addition steps in the gradient clipping algorithm.
-
The Renyi unlearning definition used in Chien et al. (2024) and related works is built directly on the Renyi divergence as defined here.