Abstract
Data-driven models are now deployed in a plethora of real-world applications including automated diagnosis but models learned from data risk learning biases from that same data. When models learn the correct features, they are robust, generalizing for uncontrolled situations in the real world. Biases in the training set destroy that robustness, because models learn spurious correlations that will not be found (at least not reliably) in real-world situations. The deployment of such models for critical tasks, such as medical decisions, can be catastrophic. In this work we address this issue for skin-lesion classification models, with two objectives: finding out what are the spurious correlations exploited by biased networks, and debiasing the models by removing such spurious correlations from them. We perform a systematic integrated analysis of 7 visual artifacts (which are possible sources of biases exploitable by networks), employ a state-of-the-art technique to prevent the models from learning spurious correlations, and propose datasets to test models for the presence of bias. We find out that, despite interesting results that point to promising future research, current debiasing methods are not ready to solve the bias issue for skin-lesion models.
Summary
This paper presents the most comprehensive study at its time of bias in skin lesion classification datasets. The authors systematically investigate two questions: (1) what spurious correlations do deep neural networks exploit in dermoscopy images, and (2) can existing debiasing methods effectively remove these biases? The work is motivated by prior findings from Bissoto et al. (2019) showing that networks can achieve expert-level performance on skin lesion classification even when up to 70% of the lesion is occluded by a bounding box, indicating heavy reliance on background features.
The authors manually annotate the presence of 7 visual artefacts — dark corners (vignetting), hair, gel borders, gel bubbles, rulers, ink markings/staining, and patches — across 2,594 images from ISIC 2018 Tasks 1 and 2 and 872 images from the Interactive Atlas of Dermoscopy. Correlation analysis reveals that individual artefact-to-label correlations are modest, suggesting that models combine multiple weak correlations into cumulative bias rather than relying on any single strong shortcut. Separate binary classifiers trained to detect each artefact achieve high AUC (80-98%), confirming that networks can easily identify these features even on heavily disturbed images.
For debiasing, the authors apply Learning Not To Learn (LNTL), the state-of-the-art bias removal method at the time. LNTL uses a feature extractor feeding both a main task classifier and bias classification heads, with reversed gradients from the bias heads to discourage the feature extractor from encoding bias-related information. The results are sobering: LNTL achieves only marginal improvements on artificially constructed “trap sets” (where artefact-label correlations are amplified and reversed between train and test), and fails to substantially improve generalization on cross-dataset evaluation (ISIC to Atlas). The authors conclude that artefacts are deeply entangled with diagnostic features in the learned representations, making gradient-reversal approaches insufficient.
Key Contributions
- Systematic artefact annotation: Manual annotation of 7 visual artefact types across two major skin lesion datasets (ISIC 2018 and Atlas), providing a foundation for bias research in dermoscopy
- Normalized-background dataset: Proposes replacing background pixels with the pixel-average training image to isolate background influence; shows models trained on disturbed images (Bbox, Bbox70) are most background-dependent
- Trap sets: Constructs controlled benchmarks where artefact-label correlations are amplified and reversed between train/test splits, forcing biased models to fail
- Negative result on debiasing: Demonstrates that LNTL, the state-of-the-art debiasing method, is insufficient for skin lesion analysis, motivating development of stronger approaches
- debiasing strategies for skin lesion analysis: Comprehensive analysis of the landscape of bias detection and removal in medical dermoscopy
Methodology
- Datasets: ISIC 2018 Task 1 and 2 (2,594 dermoscopic images, 3 classes: melanoma, nevus, seborrheic keratosis) and Interactive Atlas of Dermoscopy (872 dermoscopic + 839 clinical images)
- 7 artefact types: Dark corners, hair, gel borders, gel bubbles, rulers, ink markings, patches
- Normalized background: Replace background with pixel-average training set image (using segmentation masks) to measure background dependence
- Debiasing via LNTL (Kim et al., 2019): Feature extractor (first two ResNet blocks) + main classifier + 7 bias classification heads with gradient reversal (factor 0.3)
- Architectures: InceptionV4 (for background experiments), ResNet18 and ResNet152 (for LNTL)
- Evaluation: AUC on standard splits, cross-dataset (ISIC to Atlas), trap sets, normalized-background variants
Key Findings
- Individual artefact-label correlations are weak, but models can combine multiple weak correlations into strong cumulative bias
- Networks detect artefacts with high accuracy (80-98% AUC) even on heavily occluded images (Bbox90)
- Background information is critical: Bbox and Bbox70 variants suffer 10-11% AUC drops compared to 4-5% for Traditional and Skin Only
- LNTL fails to effectively debias: on trap sets, best performance is 62.4% (Normalized/ResNet18) vs. 52.6% baseline (Unchanged/InceptionV4)
- Cross-dataset generalization (ISIC to Atlas clinical images) is the most challenging scenario, where LNTL shows its best relative improvement (70.1% vs. 63.4%)
- Artefacts are entangled with diagnostic features in the network’s learned representations, making gradient-reversal insufficient
- The authors suggest future work should focus on feature-space disentanglement and diverse multi-source datasets
Important References
- Reducing Reliance on Spurious Features in Medical Image Classification with Spatial Specificity — later work addressing the same problem via spatial specificity
- A Case for Reframing Automated Medical Image Classification as Segmentation — uses ISIC dataset with artefact biases characterised in this paper
- Towards Certified Shortcut Unlearning in Medical Imaging — uses ISIC 2018 benchmark with artefact annotations from this work