The hockey-stick divergence (also called E-divergence or epsilon-hockey-stick divergence) is a statistical divergence measure between probability distributions that plays a central role in the analysis of differential privacy and certified approximate unlearning. For epsilon >= 0 and two probability measures mu and nu defined over R^d, the hockey-stick divergence is defined as:
E_epsilon(mu || nu) := integral over R^d of [d mu - e^epsilon * d nu]_+
where [x]_+ := max{0, x}. Equivalently, it can be expressed as: E_epsilon(mu || nu) = sup over measurable sets A of (mu(A) - e^epsilon * nu(A)).
The hockey-stick divergence directly characterizes (epsilon, delta)-differential privacy (and by extension, (epsilon, delta)-unlearning): a mechanism M satisfies (epsilon, delta)-DP if and only if E_epsilon(M(x) || M(x’)) ⇐ delta for all neighboring inputs x, x’. This makes it a natural tool for proving unlearning guarantees, as controlling E_epsilon directly translates to bounding the delta parameter.
In Certified Unlearning for Neural Networks, the hockey-stick divergence is used to analyze the model clipping for unlearning algorithm. The proof of Theorem 4.2 tracks how E_epsilon between the unlearned model distribution and the reference (retrain) distribution contracts across iterations of the clipping-plus-noise Markov kernel. A key property exploited is the data processing inequality for hockey-stick divergence: E_epsilon(mu K || nu K) ⇐ E_epsilon(mu || nu) * sup_{x1,x2} E_epsilon(K(x1) || K(x2)), which enables recursive bounding of the divergence across T iterations.
Key Details
- At epsilon = 0, E_0(mu || nu) reduces to the total variation distance: TV(mu, nu) = (1/2) * integral |d mu - d nu|.
- The hockey-stick divergence is related to Renyi divergence through conversion lemmas (e.g., Balle et al., 2020), allowing proofs that work with Renyi divergence to be converted to (epsilon, delta)-guarantees. The RDP to epsilon-delta DP conversion from MIRONOV2017 provides the standard bridge.
- Directly characterizes epsilon-delta differential privacy: a mechanism M is (epsilon, delta)-DP iff E_epsilon(M(x) || M(x’)) ⇐ delta for all neighboring x, x’ (see DWORK2014, Remark 3.2).
- For Gaussian distributions N(mu_1, sigma^2 I) and N(mu_2, sigma^2 I), the hockey-stick divergence has an explicit expression involving the Q-function (Gaussian tail probability): E_epsilon = Q(epsilonsigma/||mu_1 - mu_2|| - ||mu_1 - mu_2||/(2sigma)) - e^epsilon * Q(epsilonsigma/||mu_1 - mu_2|| + ||mu_1 - mu_2||/(2sigma)). See hockey-stick divergence for Gaussians for the full derivation.
- A simpler upper bound (Lemma A.4 in Koloskova et al.): E_epsilon(N(mu_1, sigma^2 I) || N(mu_2, sigma^2 I)) ⇐ 1.25 * exp(-sigma^2 * epsilon^2 / (2 * ||mu_1 - mu_2||^2)).
Contraction Under Markov Kernels
The hockey-stick divergence satisfies a strong data processing inequality with an explicit contraction coefficient. For any Markov operator K, the standard data processing inequality gives D_{e^epsilon}(mu K || nu K) ⇐ D_{e^epsilon}(mu || nu). Balle et al. (2019) in Privacy Amplification by Mixing and Diffusion Mechanisms sharpen this to a multiplicative contraction:
D_{e^epsilon}(mu K || nu K) ⇐ alpha_epsilon(K) * D_{e^epsilon}(mu || nu)
where alpha_epsilon(K) = sup_{x1, x2} D_{e^epsilon}(K(x1) || K(x2)) is the contraction coefficient (equivalently, the (gamma, epsilon)-Dobrushin coefficient with gamma = alpha_epsilon(K)). When alpha_epsilon(K) < 1, each application of K strictly reduces the hockey-stick divergence.
The hockey-stick divergence belongs to the family of Csiszar f-divergences with the choice phi(u) = [u - e^epsilon]+. All Csiszar divergences satisfy joint convexity: D((1-gamma)mu_1 + gammamu_2 || (1-gamma)nu_1 + gammanu_2) ⇐ (1-gamma)D(mu_1 || nu_1) + gammaD(mu_2 || nu_2), and the standard data processing inequality D(mu K || nu K) ⇐ D(mu || nu). The epsilon → D{e^epsilon} map is monotonically decreasing, and D_1 = TV (total variation).
ASOODEH2020 proved that the contraction coefficient eta_gamma(K) = sup_{x1, x2} E_gamma(K(.|x1) || K(.|x2)) (generalizing Dobrushin’s formula). For bounded-domain Gaussian kernels, this yields the explicit E-gamma divergence contraction coefficient theta_gamma(dia(D)/sigma), which is strictly less than 1 when the domain is bounded and noise is present.
These contraction results are used in two key ways in Certified Unlearning for Neural Networks (Koloskova et al., 2025): (1) the multiplicative DPI for hockey-stick divergence is the core of the model clipping for unlearning proof (Lemma A.2, Theorem 4.2), where the contraction coefficient of the clipping-plus-noise kernel determines the per-iteration privacy gain; (2) the connection between hockey-stick and Renyi divergences provides the bridge for converting the gradient clipping for unlearning Renyi-based analysis to (epsilon, delta)-guarantees.