Tweedie's Formula and Selection Bias

Abstract

We suppose that the statistician observes some large number of estimates z_i, each with its own unobserved expectation parameter mu_i. The largest few of the z_i’s are likely to substantially overestimate their corresponding mu_i’s, this being an example of selection bias, or regression to the mean. Tweedie’s formula, first reported by Robbins in 1956, offers a simple empirical Bayes approach for correcting selection bias. This paper investigates its merits and limitations. In addition to the methodology, Tweedie’s formula raises more general questions concerning empirical Bayes theory, discussed here as “relevance” and “empirical Bayes information.” There is a close connection between applications of the formula and James-Stein estimation.

Summary

This paper provides a comprehensive treatment of Tweedie’s formula — the result that for mu ~ g(.) and z|mu ~ N(mu, sigma^2), the posterior expectation is E{mu|z} = z + sigma^2 * l’(z), where l’(z) = d/dz log f(z) is the score of the marginal density f(z). The formula is foundational for score-based generative modelling: the term l’(z) = nabla log f(z) is precisely the score function that diffusion models learn to estimate, and the formula shows that this score provides the optimal Bayesian denoising correction.

Efron places Tweedie’s formula within the broader exponential family framework: for eta ~ g(.) and z|eta ~ f_eta(z) = exp(eta*z - psi(eta))*f_0(z), the posterior mean and variance are E{eta|z} = lambda’(z) and Var{eta|z} = lambda”(z), where lambda(z) = log(f(z)/f_0(z)). For the normal translation family, this recovers Tweedie’s formula and additionally gives the posterior variance as sigma^2(1 + sigma^2 * l”(z)), connecting the curvature of the log-marginal density to posterior uncertainty. The paper also extends the formula to the Poisson family — directly relevant to Poisson random bridges — and to gamma families with skewness corrections.

Key Contributions

Derives Tweedie’s formula as a special case of exponential family posterior moments, giving both mean (via l’(z)) and variance (via l”(z))
Establishes the connection E{mu|z} = unbiased estimate + Bayes correction (eq. 2.9), the same decomposition underlying denoising in diffusion models
Extends the formula to Poisson data: E{mu|z} = (z+1)f(z+1)/f(z), relevant for discrete/counting processes
Introduces the concept of empirical Bayes information — quantifying how much each “other” observation contributes to estimating a particular mu_i
Shows near-equivalence between Tweedie’s empirical Bayes and James-Stein estimation for normal priors
Extends the formula to handle “relevance” (spatially varying priors), connecting to the covariate-dependent score estimation needed in conditional generation

Methodology

The paper derives Tweedie’s formula from exponential family theory. Given the model eta ~ g(.), z|eta ~ exp(eta*z - psi(eta))*f_0(z), Bayes rule yields the posterior as an exponential family in eta with cumulant generating function lambda(z) = log(f(z)/f_0(z)). Differentiating lambda(z) yields all posterior moments. For practical implementation, Lindsey’s method estimates l(z) = log f(z) by fitting a Poisson GLM to binned data counts, yielding a smooth differentiable estimate l-hat(z) whose derivative provides the empirical Bayes correction. The James-Stein estimator emerges as a special case when J=2 in the polynomial model (eq. 3.1) with a normal prior.

Key Findings

The Bayes correction l’(z) is always negative for extreme observations — it shrinks estimates toward the center, correcting selection bias
Empirical Bayes information I(z_0) = 1/c(z_0) measures information per “other” observation, with regret ~ 1/(N*I(z_0))
The James-Stein estimator is approximately Tweedie’s formula with a 2-parameter log-density model
The formula extends to handle variable sigma^2 (Theorem 7.1), showing the posterior ratio g(mu|z_0)/g_0(mu|z_0) depends on the variance ratio lambda_mu = sigma_0/sigma_mu
Log-concavity of f(z) implies posterior variance is less than sigma^2, providing shrinkage
Connection to false discovery rates: -d/dz log(fdr(z)) = l’(z) - l_0’(z) = E{eta|z}

Important References

An Empirical Bayes Approach to Statistics — Robbins (1956), originating Tweedie’s formula and the empirical Bayes framework
Estimation of the Mean of a Multivariate Normal Distribution — Stein (1981), foundational James-Stein shrinkage estimation
Controlling the False Discovery Rate — Benjamini & Hochberg (1995), FDR procedure connected to Tweedie’s formula via Section 6

Atomic Notes

paper

Alethograph

Explorer