Summary

This paper presents the most comprehensive study at its time of bias in skin lesion classification datasets. The authors systematically investigate two questions: (1) what spurious correlations do deep neural networks exploit in dermoscopy images, and (2) can existing debiasing methods effectively remove these biases? The work is motivated by prior findings from Bissoto et al. (2019) showing that networks can achieve expert-level performance on skin lesion classification even when up to 70% of the lesion is occluded by a bounding box, indicating heavy reliance on background features.

The authors manually annotate the presence of 7 visual artefacts — dark corners (vignetting), hair, gel borders, gel bubbles, rulers, ink markings/staining, and patches — across 2,594 images from ISIC 2018 Tasks 1 and 2 and 872 images from the Interactive Atlas of Dermoscopy. Correlation analysis reveals that individual artefact-to-label correlations are modest, suggesting that models combine multiple weak correlations into cumulative bias rather than relying on any single strong shortcut. Separate binary classifiers trained to detect each artefact achieve high AUC (80-98%), confirming that networks can easily identify these features even on heavily disturbed images.

For debiasing, the authors apply Learning Not To Learn (LNTL), the state-of-the-art bias removal method at the time. LNTL uses a feature extractor feeding both a main task classifier and bias classification heads, with reversed gradients from the bias heads to discourage the feature extractor from encoding bias-related information. The results are sobering: LNTL achieves only marginal improvements on artificially constructed “trap sets” (where artefact-label correlations are amplified and reversed between train and test), and fails to substantially improve generalization on cross-dataset evaluation (ISIC to Atlas). The authors conclude that artefacts are deeply entangled with diagnostic features in the learned representations, making gradient-reversal approaches insufficient.

Key Contributions

  • Systematic artefact annotation: Manual annotation of 7 visual artefact types across two major skin lesion datasets (ISIC 2018 and Atlas), providing a foundation for bias research in dermoscopy
  • Normalized-background dataset: Proposes replacing background pixels with the pixel-average training image to isolate background influence; shows models trained on disturbed images (Bbox, Bbox70) are most background-dependent
  • Trap sets: Constructs controlled benchmarks where artefact-label correlations are amplified and reversed between train/test splits, forcing biased models to fail
  • Negative result on debiasing: Demonstrates that LNTL, the state-of-the-art debiasing method, is insufficient for skin lesion analysis, motivating development of stronger approaches
  • debiasing strategies for skin lesion analysis: Comprehensive analysis of the landscape of bias detection and removal in medical dermoscopy

Methodology

  • Datasets: ISIC 2018 Task 1 and 2 (2,594 dermoscopic images, 3 classes: melanoma, nevus, seborrheic keratosis) and Interactive Atlas of Dermoscopy (872 dermoscopic + 839 clinical images)
  • 7 artefact types: Dark corners, hair, gel borders, gel bubbles, rulers, ink markings, patches
  • Normalized background: Replace background with pixel-average training set image (using segmentation masks) to measure background dependence
  • Debiasing via LNTL (Kim et al., 2019): Feature extractor (first two ResNet blocks) + main classifier + 7 bias classification heads with gradient reversal (factor 0.3)
  • Architectures: InceptionV4 (for background experiments), ResNet18 and ResNet152 (for LNTL)
  • Evaluation: AUC on standard splits, cross-dataset (ISIC to Atlas), trap sets, normalized-background variants

Key Findings

  • Individual artefact-label correlations are weak, but models can combine multiple weak correlations into strong cumulative bias
  • Networks detect artefacts with high accuracy (80-98% AUC) even on heavily occluded images (Bbox90)
  • Background information is critical: Bbox and Bbox70 variants suffer 10-11% AUC drops compared to 4-5% for Traditional and Skin Only
  • LNTL fails to effectively debias: on trap sets, best performance is 62.4% (Normalized/ResNet18) vs. 52.6% baseline (Unchanged/InceptionV4)
  • Cross-dataset generalization (ISIC to Atlas clinical images) is the most challenging scenario, where LNTL shows its best relative improvement (70.1% vs. 63.4%)
  • Artefacts are entangled with diagnostic features in the network’s learned representations, making gradient-reversal insufficient
  • The authors suggest future work should focus on feature-space disentanglement and diverse multi-source datasets

Important References

Atomic Notes


paper