Robust AUROC is an evaluation metric used in Saab et al. (2022) and Hooper et al. (2023) to measure a model’s performance specifically on subgroups where a known spurious correlation is absent. Unlike standard AUROC, which averages over all samples and can be inflated by models exploiting spurious correlations, robust AUROC isolates the clinically critical subgroup to reveal whether the model has truly learned the target pathology.
Key Details
- For pneumothorax classification (CANDID dataset): robust AUROC is computed on patients with pneumothorax but no chest tube and patients without pneumothorax who have a chest tube — the subgroup where the spurious correlation (pneumothorax co-occurring with chest tube) does not hold
- For melanoma classification (ISIC dataset): robust AUROC evaluates performance on images where known artefacts (rulers, ink markings, dark corners) are absent
- A model exploiting the chest tube shortcut will have high standard AUROC but low robust AUROC
- Hooper et al. (2023) reported that segmentation-for-classification achieves robust AUROC of 0.84 vs. 0.58 for standard classification on the CANDID no-chest-tube subgroup — a 44.8% improvement
- Robust AUROC is related to the broader concept of worst-group performance used in distributionally robust optimisation