The Renyi divergence is a one-parameter family of divergence measures between probability distributions, originally introduced by Alfred Renyi (1961) as a generalization of the Kullback-Leibler divergence. It plays a central role in Renyi differential privacy as introduced by MIRONOV2017, and is the divergence measure underlying the Renyi unlearning definition used in certified machine unlearning.
For two probability distributions P and Q defined over the same space R, with P absolutely continuous with respect to Q, the Renyi divergence of order alpha > 1 is:
D_alpha(P || Q) = 1/(alpha - 1) * log E_{x ~ Q} [(P(x)/Q(x))^alpha]
where P(x) and Q(x) denote the densities of P and Q at x, and all logarithms are natural. The definition extends to the boundary cases by continuity. At alpha = 1, it reduces to the Kullback-Leibler divergence: D_1(P || Q) = E_{x ~ P}[log(P(x)/Q(x))], which is the KL divergence (relative entropy). At alpha = infinity, it becomes the max-divergence: D_inf(P || Q) = sup_{x in supp Q} log(P(x)/Q(x)).
The Renyi divergence interpolates between these extremes, providing a spectrum of divergence measures that capture different aspects of the relationship between distributions. Lower orders alpha are more “average-case” (dominated by the bulk of the distributions), while higher orders emphasize the tails (worst-case density ratios). This makes the Renyi divergence especially well-suited for privacy analysis, where controlling all moments of the privacy loss distribution is important.
In certified machine unlearning, the Renyi divergence appears in two key ways: (1) as the measure of closeness between the unlearned model distribution and the retrained model distribution in the Renyi unlearning definition, and (2) in the privacy amplification by iteration analysis, where contraction of Renyi divergence across iterations establishes that each noisy SGD step brings the unlearned model closer to the retrained model.
Rigorous Definition on General Spaces (Van Erven & Harremos, 2014)
The definitive treatment of Renyi divergence on arbitrary measurable spaces is given by VANERVEN2014. The general definition for simple orders (alpha in (0, infinity), alpha != 1) is:
D_alpha(P || Q) = (1 / (alpha - 1)) * ln(integral of p^alpha * q^{1-alpha} d mu)
where p, q are densities with respect to any common dominating measure mu. The definition is independent of the choice of mu. The extended orders are defined by limits:
- D_0(P || Q) = -ln Q(p > 0) (related to absolute continuity)
- D_1(P || Q) = D(P || Q) = KL divergence (see Renyi divergence and KL relationship)
- D_infinity(P || Q) = ln ess sup (p/q)
An equivalent characterization via discretization (Theorem 2): D_alpha(P || Q) = sup_P D_alpha(P|_P || Q|_P), where the supremum is over all finite partitions P.
Key Properties
- Non-negativity (Theorem 8): D_alpha(P || Q) >= 0 for all alpha in [0, infinity], with equality (for alpha > 0) iff P = Q.
- Monotonicity in order (Theorem 3): D_alpha(P || Q) is nondecreasing in alpha on [-infinity, infinity]. Higher orders give larger divergence.
- Data processing inequality (Theorems 1 and 9): D_alpha(P|_G || Q|_G) ⇐ D_alpha(P || Q) for any sub-sigma-algebra G, for all alpha in [0, infinity]. See data processing inequality for Renyi divergence.
- Convexity: Jointly convex in (P, Q) for alpha in [0, 1] (Theorem 11). Convex in Q for all alpha in [0, infinity] (Theorem 12). Jointly quasi-convex for all alpha (Theorem 13).
- Skew symmetry (Proposition 2): D_alpha(P || Q) = (alpha / (1 - alpha)) * D_{1-alpha}(Q || P) for 0 < alpha < 1.
- Pinsker’s inequality (Theorem 31): (alpha / 2) * V^2(P, Q) ⇐ D_alpha(P || Q) for alpha in (0, 1], where V is total variation distance.
- Additivity (Theorem 28): D_alpha(P1 x … x PN || Q1 x … x QN) = sum of D_alpha(Pi || Qi) for product distributions.
- Pythagorean inequality (Theorem 14): Generalized to all alpha in (0, infinity) for alpha-convex sets.
- Continuity (Theorem 7): D_alpha is continuous in alpha on the set where it is finite.
- Weak triangle inequality (Proposition 11 in Mironov): the order doubles under composition, which can degrade bounds for sequential operations.
Special Cases and Relations
- D_1 = KL divergence (see Renyi divergence and KL relationship)
- D_{1/2} = -2 * ln(1 - Hel^2/2), related to squared Hellinger distance
- D_2 = ln(1 + chi^2(P, Q)), related to chi-squared divergence
- D_infinity = ln(ess sup p/q), the worst-case log-likelihood ratio
- Ordering: Hel^2 ⇐ D_{1/2} ⇐ D_1 ⇐ D_2 ⇐ chi
- Relationship to hockey-stick divergence: can be converted via Markov’s inequality, connecting RDP to (epsilon, delta)-DP
Gaussian Case
For Gaussians N(mu_0, sigma_0^2) and N(mu_1, sigma_1^2) with positive variances (eq. 10 in Van Erven & Harremos):
D_alpha = alpha * (mu_1 - mu_0)^2 / (2 * sigma_alpha^2) + (1 / (1 - alpha)) * ln(sigma_alpha / (sigma_0^{1-alpha} * sigma_1^alpha))
where sigma_alpha^2 = (1 - alpha) * sigma_0^2 + alpha * sigma_1^2. For equal variances: D_alpha(N(mu_0, sigma^2) || N(mu_1, sigma^2)) = alpha * (mu_1 - mu_0)^2 / (2 * sigma^2).
Gaussian Dichotomy (Kakutani)
For infinite product distributions P = P_1 x P_2 x … and Q = Q_1 x Q_2 x …, Theorem 29 (Kakutani’s Dichotomy) states that for alpha in (0, 1), either P ~ Q (equivalent, sum D_alpha(Pn || Qn) < infinity) or P is perpendicular to Q (mutually singular, sum = infinity). There is no intermediate case.