Abstract
Many medical AI imaging applications rely on segmentation-classification pipelines. Alas, in practice, coarse or imprecise segmentation masks often admit spurious background cues and the model learns the associations between these cues and the pathological regions of interest (organs, lesions, etc.). Such shortcuts tend to severely affect the accuracy of the downstream task. Prior art relies on fine-grained pixel-level annotations to separate background cues from pathology regions. Unfortunately, high-quality annotations are expensive. We aim to ensure high performance on downstream tasks without relying on expensive annotations. The key novelty of our approach lies in using machine unlearning to mitigate shortcut learning. In contrast to relying on heuristic (empirically validated only) methods, we propose a principled approach that, for the first time, provides guarantees for shortcut unlearning. We develop a definition of certified unlearning for segmentation tasks and formally connect it to its canonical definition. We then present a novel formal framework that, unlike related research, does not rely on unrealistic assumptions. Further, we introduce global information reduction, as a stable and task-relevant metric to guide the process of removing shortcut learning from pre-trained models. Finally, we directly translate our formal framework into a practical pipeline, which we showcase in two complex medical settings.
Summary
This paper establishes the first formal connection between segmentation mask refinement and certified machine unlearning. The core insight is that correcting a coarse (dilated) segmentation mask to a finer one is set-theoretically isomorphic to “forgetting” the spurious pixels introduced by dilation — the dilation artefacts that cause models to learn shortcuts from background features (e.g., surgical rulers, ink markings, gel bubbles in dermoscopy images).
Building on this isomorphism, the authors define Certified Pixel-Level Unlearning, which projects standard (epsilon, delta)-indistinguishability onto the conditional probability space of pixel-wise predictions. They introduce Global Spurious Mutual Information (S_global) as a rigorous metric that captures worst-case information leakage from spurious features into predictions, and prove that a certified unlearning operator strictly upper-bounds S_global up to additive certification error O(epsilon) + O(delta log(1/delta)).
Empirically, the framework is validated on a synthetic dataset and the ISIC 2018 melanoma detection benchmark, using gradient clipping and model clipping from Koloskova et al. (2025) alongside NegGrad-Seg. Results show that unlearning with only 10% fine-grained labels matches or exceeds retrain-from-scratch performance, while being far more sample-efficient.
Key Contributions
- Unlearning Isomorphism: Proves correcting dilated masks is set-theoretically isomorphic to forgetting specific training samples (the dilation artefacts), enabling direct application of certified unlearning theory to segmentation refinement
- Certified Pixel-Level Unlearning: New definition adapting (epsilon, delta)-indistinguishability to the conditional pixel-wise output space, preventing vacuous solutions common in standard definitions
- Global Spurious Mutual Information (S_global): Task-relevant metric quantifying worst-case shortcut reliance; proven to be strictly upper-bounded by certified unlearning
- Formal guarantees without unrealistic assumptions: Unlike Saab et al. (2022), does not require conditional independence of spurious features given the mask (shown to be violated in Appendix B)
- Sample efficiency: Achieves competitive debiasing with only 10% fine-grained annotations
Methodology
The pipeline operates in three stages:
- Pre-training: Train segmentation model on coarse (bounding box) annotations
- Unlearning: Apply certified unlearning operators to “forget” the dilation artefacts (pixels incorrectly labelled as foreground by coarse masks), using a small set of fine-grained masks to define the forget set D_f
- Evaluation: Measure downstream classification performance via average pixel pooling
Two unlearning algorithms are adapted:
- gradient clipping and model clipping from Koloskova et al. (2025) — provides (epsilon, delta)-certified guarantees
- NegGrad-Seg — adapted from NegGrad+, not certified but preserves model utility
Transfer learning (public frozen encoder + small trainable decoder) reduces the parameter space requiring certification.
Key Findings
- Models trained on coarse masks exhibit strong shortcut dependence (AUROC drops on images with spurious features)
- Unlearning with 10% fine-grained labels yields best improvements across most spurious feature categories
- Two-phase unlearning trajectory: initial performance degradation (penalising shortcut weights) followed by rapid recovery (relearning from retain set)
- Certified operators (Koloskova et al.) achieve stable improvements with low variance (std ⇐ 0.022) vs. retrain-from-scratch (std = 0.042)
- NegGrad-Seg, while not certified, does not destroy model utility and induces shortcut unlearning
- Model collapse observed when trying to certify full-scale models (melanoma detection) — a key limitation motivating future work on robust certified unlearning for non-convex losses
Important References
- Certified Unlearning for Neural Networks — provides the (epsilon, delta)-certified unlearning algorithms (model clipping, gradient clipping) used in this work
- Reducing Reliance on Spurious Features in Medical Image Classification with Spatial Specificity — prior work on spatial specificity for shortcut reduction; this paper relaxes its conditional independence assumption
- Remember What You Want to Forget Algorithms for Machine Unlearning — foundational definition of certified unlearning that this work extends to pixel-level
Atomic Notes
- certified pixel-level unlearning
- global spurious mutual information
- shortcut learning
- NegGrad-Seg
- unlearning isomorphism