Markov kernel contraction refers to the phenomenon whereby applying a Markov operator K to probability distributions causes statistical divergences between those distributions to decrease. This property is fundamental to the theory of privacy amplification by iteration and amplification by mixing, and provides the mathematical mechanism underpinning certified unlearning guarantees for iterative algorithms.
A Markov operator K : X → P(Y) maps points in X to probability distributions over Y. It acts on distributions by left multiplication: (mu K)(E) = integral K(x)(E) mu(dx), and on functions by right multiplication: (Kf)(x) = integral f(y) K(x, dy). The data processing inequality states that for any f-divergence D_f (including total variation, KL divergence, Renyi divergence, and hockey-stick divergence), D_f(mu K || nu K) ⇐ D_f(mu || nu). That is, applying any Markov operator cannot increase divergence — this is the standard post-processing guarantee.
The stronger property of strict contraction occurs when D_f(mu K || nu K) < D_f(mu || nu) for mu != nu. The quantitative version introduces the contraction coefficient alpha(K) such that D_f(mu K || nu K) ⇐ alpha(K) * D_f(mu || nu). For the hockey-stick divergence D_{e^epsilon}, Balle et al. (2019) in Privacy Amplification by Mixing and Diffusion Mechanisms define the contraction coefficient as alpha_epsilon(K) = sup_{x1,x2} D_{e^epsilon}(K(x1) || K(x2)). The classical result of Cohen et al. (1993) and Del Moral et al. (2003) established that for the Dobrushin coefficient gamma = sup_{x,x’} TV(K(x), K(x’)), all Csiszar f-divergences satisfy D(mu K || nu K) ⇐ gamma * D(mu || nu).
In the context of differential privacy and unlearning, each step of a noisy optimization algorithm (gradient computation + noise addition, possibly with clipping and projection) defines a Markov kernel. The contraction properties of this kernel determine how quickly privacy/unlearning guarantees improve with additional iterations. For the shifted Renyi divergence used in Feldman et al.’s analysis, the contraction manifests through the contraction lemma (Lemma 21): applying contractive (1-Lipschitz) maps cannot increase the shifted divergence. For the hockey-stick divergence used in Koloskova et al.’s model clipping analysis, the contraction manifests through the multiplicative data processing inequality.
Key Details
- Data processing inequality (DPI): For any Markov kernel K and any f-divergence D_f: D_f(mu K || nu K) ⇐ D_f(mu || nu). This is equivalent to the post-processing property of differential privacy.
- Strong Markov contraction (Del Moral et al., 2003): If gamma = sup_{x,x’} TV(K(x), K(x’)) < 1, then for any Csiszar f-divergence D, D(mu K || nu K) ⇐ gamma * D(mu || nu). This is the basis for the gamma-Dobrushin amplification result.
- Hockey-stick DPI with contraction (Balle et al., 2019): D_{e^epsilon}(mu K || nu K) ⇐ alpha_epsilon(K) * D_{e^epsilon}(mu || nu), where alpha_epsilon(K) = sup_{x1,x2} D_{e^epsilon}(K(x1) || K(x2)). This is the key inequality in Koloskova et al.’s Lemma A.2 for model clipping for unlearning.
- Renyi divergence DPI via couplings (Balle et al., 2019, Theorem 2): R_alpha(mu K || nu K) ⇐ R_alpha(omega || nu) + sup_{x in supp(nu)} R_alpha((H_pi K)(x) || K(x)), generalizing Feldman et al.’s shift-reduction lemma to measure-theoretic settings.
- Iterated kernels: For a sequence of kernels K_1, …, K_T, divergence compounds multiplicatively: D_{e^epsilon}(mu K_1…K_T || nu K_1…K_T) ⇐ prod_{t=1}^{T} alpha_epsilon(K_t) * D_{e^epsilon}(mu || nu), giving exponential contraction when each alpha_epsilon(K_t) < 1.
- Gaussian kernel example: For K(x) = N(psi(x), sigma^2 I) with L-Lipschitz psi, the contraction coefficient depends on both L and sigma, and the Renyi divergence satisfies R_alpha(K(x) || K(x’)) = alpha * L^2 * ||x - x’||^2 / (2 * sigma^2).