Skull-stripping induces shortcut learning in MRI-based Alzheimer’s disease classification

Abstract

Objectives

High classification accuracy of Alzheimer’s disease (AD) from structural MRI has been achieved using deep neural networks, yet the specific image features contributing to these decisions remain unclear. In this study, the contributions of T1-weighted (T1w) gray-white matter texture, volumetric information, and preprocessing—particularly skull-stripping—were systematically assessed.

Materials and methods

A dataset of 990 matched T1w MRIs from AD patients and cognitively normal controls from the ADNI database was used. Preprocessing was varied through skull-stripping and intensity binarization to isolate texture and shape contributions. A 3D convolutional neural network was trained on each configuration, and classification performance was compared using exact McNemar tests with discrete Bonferroni-Holm correction. Feature relevance was analyzed using Layer-wise Relevance Propagation, image similarity metrics, and spectral clustering of relevance maps.

Results

Despite substantial differences in image content, classification accuracy, sensitivity, and specificity remained stable across preprocessing conditions. Models trained on binarized images preserved performance, indicating minimal reliance on gray-white matter texture. Instead, volumetric features—particularly brain contours introduced through skull-stripping—were consistently used by the models.

Conclusion

This behavior reflects a shortcut learning phenomenon, where preprocessing artifacts act as potentially unintended cues. The resulting Clever Hans effect emphasizes the critical importance of interpretability tools to reveal hidden biases and to ensure robust and trustworthy deep learning in medical imaging.

Critical relevance statement

We investigated the mechanisms underlying deep learning-based disease classification using a widely utilized Alzheimer’s disease dataset, and our findings reveal a reliance on features induced through skull-stripping, highlighting the need for careful preprocessing to ensure clinically relevant and interpretable models.

Key Points

Shortcut learning is induced by skull-stripping applied to T1-weighted MRIs.

Explainable deep learning and spectral clustering estimate the bias.

Highlights the importance of understanding the dataset, image preprocessing and deep learning model, for interpretation and validation.

Full text

Translate

Turn on search term navigation

Introduction

Alzheimer’s disease (AD) is the most common form of dementia, accounting for 60–70% of cases [1], and with over 55 million people worldwide living with some form of dementia, it poses a substantial burden on healthcare systems, caregivers, and families [2]. However, in vivo diagnosis remains challenging due to the overlap of clinical symptoms with other conditions, resulting in relatively low diagnostic accuracy (71–87% sensitivity and 44–71% specificity) [3].

In addition to clinical and neuropsychological assessments, medical imaging is employed to improve diagnostic accuracy. Positron emission tomography (PET) imaging with amyloid and tau protein ligands, combined with magnetic resonance imaging (MRI), has become a valuable tool in AD diagnosis [4]. Yet, AD is characterized by a prolonged prodromal and asymptomatic inflammatory phase, during which PET imaging is unsuitable for predicting disease onset in healthy populations. Since pathological changes in AD begin decades before clinical symptoms appear, MRI holds promise for identifying early biomarkers in a broad population. Currently, brain volumetry [5], particularly hippocampal atrophy [6], is widely used as an imaging marker for differential diagnosis and in interventional studies.

In recent years, convolutional neural networks (CNNs) have emerged as the state-of-the-art method for AD classification using structural T1-weighted (T1w) MRI scans [7]. These networks learn image features directly during the training process, eliminating the need for manual feature selection. Despite their advantages, CNN models and the features they extract are often difficult for humans to interpret, earning them the reputation of being “black boxes” [8].

To address this issue, interpretability methods like heatmapping have been introduced [9]. One notable technique is layer-wise relevance propagation (LRP) [10], which highlights input features driving model decisions. The importance of such tools is illustrated by cases like [11], where a model identified horses based on watermarks rather than the animals—an instance of the Clever Hans effect [12, 13]. This term, drawn from a horse once thought to perform arithmetic but later shown to respond to subtle cues [14], exemplifies shortcut learning: the exploitation of spurious correlations over meaningful patterns [15].

In AD classification, interpretability research has shown that preprocessing steps can shape both performance and learned features [16]. We hypothesize that skull-stripping, a common step, may introduce unintended cues and lead models to overlook more relevant AD-specific markers, such as structural atrophy and gray-white matter contrast changes [17]. In this study, we combined deep learning and heatmapping explainability techniques to evaluate the performance and learned features of CNNs trained on different input configurations. Specifically, we trained identical CNN architectures on full T1w images from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, their skull-stripped counterparts, and three differently binarized versions of these two preprocessing approaches, creating a total of eight model configurations. We first analyzed performance metrics for significant differences, assessed structural similarities between LRP-based heatmaps, and investigated the spatial distribution of heatmap relevance to examine whether preprocessing introduced unintended cues [18].

Materials and methods

Imaging data

Using MR image metadata from the ADNI database (https://adni.loni.usc.edu), we created a subset of images with clinically available and consistent properties. The final search criteria were: Phase = ADNI 2, Acquisition Plane = SAGITTAL, Field Strength = 3.0 Tesla, Pixel Spacing XY = [1.0 mm, 1.1 mm], Slice Thickness = 1.2 mm, and Weighting = T1. These criteria resulted in 1042 images for the AD group and 2227 images for the normal control (NC) group. This ensured that patients and controls were scanned using a consistent MRI protocol at 3 Tesla across multiple scanner vendors and sites. Supplementary Tables S1.1 and S1.2 in Supplementary Material 1 provide an overview of the distribution of images across sites, vendors, imaging protocols, and research groups.

Research groups

We retrospectively selected 990 MR images from 159 patients with AD and 990 MR images from 201 NCs that were propensity-logit-matched using age and sex as covariates from the preselected image data subset [19]. Supplementary Fig. 1.1 details effect sizes before and after matching. Table 1 shows the demographics of the selection. Data were split into training, validation, and test sets (ratio 70:15:15), ensuring that all scans from a given individual were included in the same set. To maintain consistent class distribution, the final sets were created by combining data from both cohorts.

Table 1. Summary of subject demographics at baseline for ADNI

	Subjects	Images	Age	Gender	MMSE	CDR	APOE	Education
NC	201	990	75.1 ± 7.1 [56.3, 95.8]	102 M/99 F	28.9 ± 1.2 [24.0, 30.0] n/a: 16	0.0: 171; 0.5: 13; n/a: 17	ε2/ε2: 1; ε2/ε3: 20; ε2/ε4: 1; ε3/ε3: 112; ε3/ε4: 49; ε4/ε4: 5; n/a: 13	16.7 ± 2.5 [12.0, 20.0] n/a: 0
AD	159	990	75.3 ± 7.9 [55.7, 91.5]	91 M/68 F	22.0 ± 3.8 [4.0, 30.0] n/a: 48	0.5: 35; 1.0: 67; 2.0: 9; n/a: 48	ε2/ε2: 1; ε2/ε3: 5; ε2/ε4: 2; ε3/ε3: 41; ε3/ε4: 68; ε4/ε4: 29; n/a: 13	15.7 ± 2.7 [9.0, 20.0] n/a: 0

Subjects

Images

Age

Gender

MMSE

CDR

APOE

Education

201

990

75.1 ± 7.1 [56.3, 95.8]

102 M/99 F

28.9 ± 1.2 [24.0, 30.0] n/a: 16

0.0: 171; 0.5: 13; n/a: 17

ε2/ε2: 1;

ε2/ε3: 20;

ε2/ε4: 1;

ε3/ε3: 112;

ε3/ε4: 49;

ε4/ε4: 5;

n/a: 13

16.7 ± 2.5 [12.0, 20.0] n/a: 0

159

990

75.3 ± 7.9 [55.7, 91.5]

91 M/68 F

22.0 ± 3.8 [4.0, 30.0] n/a: 48

0.5: 35; 1.0: 67; 2.0: 9; n/a: 48

ε2/ε2: 1;

ε2/ε3: 5;

ε2/ε4: 2;

ε3/ε3: 41;

ε3/ε4: 68;

ε4/ε4: 29;

n/a: 13

15.7 ± 2.7 [9.0, 20.0] n/a: 0

Values are presented as mean ± SD [range]

Education in years, n/a: no value available

M male, F female, MMSE mini-mental state examination, CDR global clinical dementia rating, APOE apolipoprotein E status

Additionally, this study used data acquired in local studies [20, 21] approved by the ethics committee of the Medical University of Graz (IRB00002556), and signed informed consent was obtained from all study participants or their caregivers. The trial protocol for this prospective study was registered at the National Library of Medicine (trial identification number: NCT02752750). All methods were performed in accordance with the relevant guidelines and regulations.

Preprocessing

Raw T1w images were reoriented to standard space using FSL-REORIENT2STD [22], cropped to a 160 × 240 × 256 matrix size, bias field corrected using N4 [23], and non-linearly registered to the MNI152 template via FSL-FNIRT [22]. Intensity values were normalized to the white matter peak of the brain tissue histogram (196 bins). The outputs of this preprocessing pipeline are referred to as “aligned” images. Individual brain masks were generated in native image space using SIENAX from FSL [24] and warped to the aligned images to create the “skull-stripped” images. Binary masks preserving shape information were derived using manually selected thresholds of 13.75%, 27.5%, and 41.25% of the white matter peak of the brain tissue histogram and the aligned images. These binary masks were also combined with skull-stripped preprocessing, resulting in eight total setups, as illustrated in Fig. 1. Supplementary Fig. S1.2 illustrates that thresholds were selected to preserve meaningful atrophy patterns by comparing residual voxels with individual brain masks.

[See PDF for image]

Fig. 1

Input image setups: Left column with (A1) aligned T1w MRI, identical binarized T1w images with the manually selected threshold levels (B1) 13.75%, (C1) 27.50% and (D1) 41.25%, and in the right column the corresponding skull-stripped versions (A2, B2, C2, D2)

Standard classification network

We utilized a conventional 3D subject-level classifier network as described in [7]. However, because the number of trainable parameters (42 million) relative to the dataset size (1980 images) is high, the network is prone to overfitting. To address this, we reduced the number and size of the convolutional and fully connected layers until the network no longer overfit the training data and the validation accuracy ceased to improve. See Supplementary Material 2 for loss and performance curves. Batch normalization layers did not influence the network’s performance and were therefore omitted. Additionally, we replaced max pooling layers with convolutional layers using striding, as tested in [25]. This modification improves the interpretability of the network [26]. Dropout was not applied in the network. To further enhance interpretability, all biases in the classifier were constrained to be non-positive, which helped sparsify the network activations [26].

The final 3D classifier network, as shown in Fig. 2, consists of a single convolutional layer (kernel size: 3 × 3 × 3, 8 channels) combined with a down-convolutional layer (kernel size: 3 × 3 × 3, 8 channels, striding: 2) as its primary building block. The network stacks four of these main building blocks, followed by two fully connected layers (with 16 and 2 units, respectively), resulting in a total of 0.3 million trainable parameters. Each layer is followed by a Rectified Linear Unit activation function, except for the output layer, which employs a Softmax activation.

[See PDF for image]

Fig. 2

Structure of the 3D classifier network. Dimensionalities between layers are the tensor sizes

Training

Models were trained on aligned images, skull-stripped images, and their binarized counterparts. Training was performed using the Adam optimizer [27] for 30 epochs with a batch size of 20 using three independent network weights initializations [28], minimizing the binary cross-entropy loss. Each data sampling was trained with all three initializations.

Heatmapping and relevance-weighted heatmap presentation

Heatmaps were created using the LRP method with α = 1.0 and β = 0.0, as described in [10]. Each voxel is attributed a relevance score (R). To analyze relevance heatmaps, we qualitatively examined individual maps and calculated mean heatmaps for each configuration. For each mean heatmap, we generated a histogram of relevance values. Starting from the bin with the highest relevance, bin contents were iteratively summed until 40% of the total relevance was included. The lower bound of the final bin was used as the lower threshold for windowing the mean heatmap. All heatmaps presented in this study are overlaid on the MNI152 1 mm template and windowed to display the top 40% of relevance.

Random sampling

For each data sampling (10 samplings) and each network weight initialization (3 initializations), we retrained the network, resulting in 30 training sessions for each input image configuration [29]. Non-converging training sessions were excluded from further analysis. From the remaining sessions, the best-performing run based on validation accuracy was selected to create the mean heatmaps.

Statistical analysis

We calculated performance metrics, including accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC), for all model configurations. Results are reported as mean values, standard deviations, and 95% confidence intervals.

The model trained on skull-stripped T1w images (see A2 in Fig. 1) was designated as the reference model for statistical comparisons. Exact McNemar tests were performed to compare accuracy, sensitivity, and specificity between the reference model and each alternative model for each session run (combination of data sampling and weight initialization), provided both runs were available [30]. To account for multiple comparisons, we applied the discrete Bonferroni-Holm correction, a statistical method suitable for adjusting comparisons involving overlapping data splits [31].

Heatmap similarity analysis

To evaluate heatmap similarities, we compared the best-performing run of the reference model with the corresponding runs (same data sampling and weight initialization) of each alternative model using structural image similarity measures. Before comparison, heatmaps were normalized to min-max. Voxel-wise similarity was assessed with the root mean square error (RMSE), while global trends and overall similarity were evaluated using the Pearson correlation coefficient. Both global and localized patterns were analyzed with the mean structural similarity index measure (MSSIM) and Earth Mover’s Distance (EMD). Additionally, binarized heatmaps highlighting the top 40% and top 10% relevance values were compared using the Intersection over Union (IoU) metric.

Results

The following sections present quantitative performance results across setups, followed by an analysis of CNN-extracted features using LRP and image similarity metrics.

Model performances

Table 2 provides the results of the accuracy, sensitivity, specificity, and AUC for all configurations in the random sampling setup. Performance metrics were evaluated across non-excluded training sessions for each model configuration.

Table 2. Summary of performance metrics of all configurations

Input images	Id	Binarizer	Accuracy	Sensitivity	Specificity	AUC
Aligned T1w	A1	None	71.12 ± 5.01% [61.34%, 82.52%]	67.47 ± 9.90% [51.66%, 85.94%]	74.76 ± 7.07% [62.03%, 85.45%]	0.71 ± 0.05 [0.62, 0.83]
	B1	13.75%	62.51 ± 5.45% [53.39%, 72.09%]	62.26 ± 9.48% [47.70%, 84.93%]	62.79 ± 8.35% [46.56%, 74.42%]	0.63 ± 0.054 [0.53, 0.72]
	C1	27.50%	72.74 ± 5.49% [61.41%, 82.11%]	71.37 ± 9.10% [56.86%, 88.87%]	74.15 ± 9.94% [51.67%, 88.37%]	0.73 ± 0.055 [0.61, 0.82]
	D1	41.25%	77.95 ± 4.57% [70.90%, 86.34%]	76.74 ± 9.41% [60.10%, 94.56%]	79.15 ± 6.80% [64.54%, 89.56%]	0.78 ± 0.045 [0.71, 0.86]
Skull-stripped T1w	A2	None	81.63 ± 3.77% [74.36%, 88.01%]	81.22 ± 6.94% [69.59%, 93.15%]	82.11 ± 7.92% [65.50%, 93.60%]	0.82 ± 0.037 [0.74, 0.88]
	B2	13.75%	78.12 ± 4.63% [70.79%, 85.79%]	76.83 ± 7.03% [62.65%, 87.05%]	79.40 ± 6.76% [65.53%, 89.91%]	0.78 ± 0.046 [0.71, 0.86]
	C2	27.50%	79.57 ± 3.92% [73.46%, 86.45%]	78.32 ± 7.74% [66.79%, 93.87%]	80.92 ± 7.71% [67.53%, 91.95%]	0.80 ± 0.039 [0.74, 0.86]
	D2	41.25%	81.56 ± 4.63% [72.31%, 88.67%]	79.69 ± 9.42% [62.59%, 96.48%]	83.50 ± 6.77% [72.48%, 96.29%]	0.82 ± 0.046 [0.72, 0.89]

Column Id refers to the preprocessing defined in Fig. 1

Values between brackets show the 95% confidence interval

AUC area under receiver operating characteristic curve

Values in bold highlight best result for metric

Using an alpha level of 0.05 and discrete Bonferroni–Holm correction for multiple exact McNemar tests, comparisons between the reference model (A2, skull-stripped, no binarization) and alternative configurations—including binarized skull-stripped images at thresholds of 13.75% (B2, 8 significant differences), 27.50% (C2, 1 significant difference), and 41.25% (D2, 4 significant differences) as well as aligned images binarized at 41.25% (D1, 7 significant differences)—revealed little to no evidence for significant differences in accuracy, sensitivity, and specificity. 120 of 600 total comparisons remained significant after correction for multiple testing. See Supplementary Table S3.1 for significant differences in model comparisons.

Aligned images with binarization thresholds 13.75% and 27.50% (B1, C1) performed comparably to aligned T1w images without binarization (A1). Similarly, skull-stripped images with binarization (B2, C2, D2) and aligned images binarized at 41.25% (D1) exhibited comparable performance to skull-stripped images without binarization (A2) while outperforming other aligned image configurations.

Additionally, model performances were tested on our local, non-public datasets. See Supplementary Material 4 for results.

Feature similarities

Figure 3 shows mean heatmaps for classification decisions on test images. Skull-stripping enhances classification accuracy, while mean heatmaps from binarized inputs closely resemble those from non-binarized inputs.

[See PDF for image]

Fig. 3

Mean heatmaps from test images: Left column with (A1) aligned T1w MRI, identical binarized T1w image with threshold levels (B1) 13.75%, (C1) 27.50%, and (D1) 41.25%, and right column with corresponding skull-stripped versions (A2, B2, C2, D2). The mean accuracies of the models are shown in yellow

Heatmaps from the skull-stripped model (A2) serve as the reference for structural heatmap comparisons in Table 3. Skull-stripped binarization (B2, C2, D2) shows lower voxel-wise dissimilarity (RMSE), higher global similarity (Pearson Correlation), and improved localized similarity (MSSIM, EMD) compared to the aligned versions (A1, B1, C1, D1). Among the binarized models, skull-stripped-binarization-13.75% (B2) demonstrates the highest overall structural similarity (RMSE, Pearson Correlation, MSSIM, EMD) with the reference (A2), while skull-stripped-binarization-27.50% exhibits the strongest regional overlap with the reference. These results emphasize the dominant role of volumetric features over T1w contrast variations. Features similarities and model misclassification analysis were furthermore done using spectral clustering [11, 32] and t-distributed stochastic neighbor embedding [33]. See Supplementary Material 5 for an introduction and results.

Table 3. Summary of heatmap similarity metrics for all configurations compared to the reference model (A2)

Input images	Id	Binarizer %	RMSE	Pearson correlation	MSSIM	EMD	IoU Top 40% R	IoU Top 10% R
Aligned T1w	A1	None	10.78 ± 3.76 [5.98, 21.47]	0.13 ± 0.04 [0.06, 0.21]	0.25 ± 0.08 [0.12, 0.42]	8.44 ± 3.25 [3.99, 17.48]	0.06 ± 0.02 [0.02, 0.11]	0.01 ± 0.01 [0.00, 0.03]
	B1	13.75	3.85 ± 0.87 [2.25, 5.48]	0.08 ± 0.03 [0.03, 0.15]	0.87 ± 0.04 [0.78, 0.94]	0.50 ± 0.29 [0.14, 1.29]	0.02 ± 0.01 [0.01, 0.04]	0.00 ± 0.00 [0.00, 0.01]
	C1	27.50	4.77 ± 0.96 [3.07, 6.64]	0.18 ± 0.05 [0.08, 0.26]	0.83 ± 0.04 [0.74, 0.89]	0.71 ± 0.45 [0.11, 1.71]	0.07 ± 0.02 [0.03, 0.10]	0.01 ± 0.01 [0.00, 0.02]
	D1	41.25	6.17 ± 1.15 [3.98, 8.34]	0.23 ± 0.05 [0.13, 0.30]	0.72 ± 0.06 [0.62, 0.83]	1.61 ± 0.62 [0.54, 2.75]	0.10 ± 0.02 [0.05, 0.14]	0.01 ± 0.01 [0.00, 0.03]
Skull-stripped T1w	A2	None	0.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00
	B2	13.75	2.79 ± 0.49 [1.97, 3.79]	0.58 ± 0.07 [0.44, 0.69]	0.91 ± 0.03 [0.82, 0.96]	0.39 ± 0.25 [0.12, 1.05]	0.24 ± 0.06 [0.13, 0.35]	0.15 ± 0.03 [0.08, 0.21]
	C2	27.50	3.54 ± 0.61 [2.45, 4.85]	0.56 ± 0.07 [0.39, 0.67]	0.86 ± 0.04 [0.78, 0.92]	0.73 ± 0.44 [0.13, 1.67]	0.31 ± 0.05 [0.20, 0.37]	0.17 ± 0.03 [0.10, 0.24]
	D2	41.25	3.72 ± 0.59 [2.69, 4.96]	0.51 ± 0.08 [0.33, 0.65]	0.80 ± 0.05 [0.68, 0.89]	1.01 ± 0.50 [0.23, 2.12]	0.26 ± 0.03 [0.18, 0.31]	0.16 ± 0.04 [0.09, 0.22]

Values between brackets show the 95% confidence interval

RMSE root mean square error, MSSIM mean structural similarity index measure, EMD Earth Mover’s distance, IoU intersection over union, R relevance

Values in bold highlight best result for metric

Discussion

Previous studies using T1w MRI for AD classification have reported strong CNN performance but often failed to clarify which image features—such as volumetric patterns, signal textures, or preprocessing artifacts—drive model decisions [16, 34, 35]. This lack of interpretability raises concerns about spurious correlations and shortcut learning. We address this gap through a systematic analysis using 1980 T1w MR images from a widely used AD dataset. Applying explainable deep learning, McNemar-based model comparisons, and spectral clustering of LRP heatmaps, we disentangled the contributions of intensity and anatomical information. Texture removal via image binarization further isolated structural cues. Our analysis reveals a bias: CNNs rely heavily on volume-based features rather than biologically specific microstructural changes. These findings highlight the need for rigorous validation of AI tools and a deeper understanding of how dataset properties, preprocessing, and model behavior interact—an integrative perspective still lacking in the field.

To enable robust and unbiased evaluation, we curated a dataset with high inter-class image similarity and balanced class proportions [7]. We then applied one-to-one matching by sex and age to reduce covariate-driven influences, aiming to minimize confounders and ensure models focused on AD-related structural and contrast changes [17, 35]. All models used non-linear registration to align MRIs to MNI152 space, accounting for individual anatomical variation. However, this does not fully correct for atrophy, which varies regionally in AD, especially in the hippocampus and cortical gray matter [16]. Skull-stripping, our reference preprocessing step, is widely used in AD classification and supported by heatmap-based studies showing improved accuracy in AD [34], dementia [36], multiple sclerosis [37], and brain age prediction [38, 39–40]. Heatmaps generated in this study corroborate these findings, with relevance predominantly concentrated at the tissue boundary. To further probe the contribution of volumetric features, we removed tissue contrast entirely through binarization, isolating atrophic and structural features for analysis.

We investigated three binarization levels, each aligned with the white matter intensity peak of the image, as depicted in Fig. 1. These levels, although chosen arbitrarily, retained differing proportions of anatomical structures, capturing distinct aspects of atrophy, including ventricular enlargement, hippocampal shrinking and cerebellum morphology. As CNNs seem to focus on high-contrast regions [37, 41], these binarization levels allowed us to systematically dissect how different volumetric and structural features influenced model predictions.

Performance metrics, summarized in Table 2, reveal that removing gray-white matter contrast while retaining skull-stripping has little to no effect on model performance. Statistical analyses using exact McNemar tests, adjusted with discrete Bonferroni-Holm correction (α = 0.05), revealed minimal evidence of significant differences in accuracy, sensitivity, and specificity across configurations when compared to the reference model. Specifically, skull-stripped and binarized models at thresholds of 13.75%, 27.50%, and 41.25%, as well as aligned images binarized at 41.25%, showed fewer than 10% of comparisons with significant differences. This suggests that volumetric information is sufficient for achieving high classification accuracy in CNN-based AD classification, with minimal contribution from gray-white texture variations.

Given the consistent model performances across configurations, we examined structural image similarities using similarity metrics applied on heatmaps, shown in Table 3. Surprisingly, the model trained with binarization at 13.75% (B2)—which retains the most tissue within the brain mask—exhibited the highest similarity to the reference model (A2) across global (RMSE, Pearson correlation) and localized (MSSIM, EMD) similarity metrics. This indicates that the key volumetric and morphological features driving classification are predominantly encoded in the brain mask’s volume and shape.

Furthermore, starting at binarization-27.50% (C1, C2), models began to incorporate additional regions such as the ventricles, with binarization at 41.25% (D1, D2) capturing also hippocampal features. Notably, the highest overlaps between reference and binarized models were observed at 27.50% (C2), as indicated by intersection-over-union metrics. This suggests that contrast in ventricular regions in the reference model provides sufficient signal for the model to identify disease-relevant patterns.

Overall is the classification performance of the models driven by high contrast variations, either given by brain structures like the ventricular system and the hippocampi, or by being introduced artificially through image preprocessing. Multi-center studies have repeatedly shown that scanner vendor, acquisition protocol, and preprocessing pipeline systematically alter radiomic and image features, and consequently classifier behavior [42, 43]. Effects can persist even after harmonization if applied improperly [44, 45].

Drawing parallels to the Clever Hans effect—where unintended cues in the experimental setup were inadvertently learned—we observed a similar phenomenon in deep learning-based AD classification. Preprocessing, particularly skull-stripping in T1w imaging, is crucial for achieving state-of-the-art performance [7, 34, 36, 40], as demonstrated by the inferior results of model A1 (aligned, no binarization) compared to A2 (skull-stripped, no binarization). However, when combined with a CNN, this preprocessing acts as an interviewer effect, steering the model toward artificially introduced yet well-established features like volumetry. This underscores the need for careful control of preprocessing artifacts [16] and suggests that quantitative MRI parameter maps [46, 47–48] or model regularization [16] could offer a more robust alternative by minimizing reliance on such artificial cues.

Limitations

This study has its limitations. First, while the dataset is large, representative, and carefully crafted, it is derived solely from the ADNI database. However, the final models were tested on our local, non-public datasets. Second, our analysis primarily examines T1w images, preprocessing strategies, their impact on classification performance, and the features extracted using heatmaps. Third, the CNN architecture was intentionally simplified to control overfitting and was optimized for the reference model (skull-stripped, no binarization) by systematically reducing trainable parameters and complexity, alongside hyperparameter tuning. Although the final architecture achieved performance metrics comparable to existing literature, applying the same setup across experiments aimed to minimize bias but may not eliminate it entirely. It remains to be investigated whether similar effects would arise in more complex 3D architectures. However, we expect that such models would also be susceptible to the bias estimated in this study. Lastly, while heatmaps and similarity metrics were effective for feature interpretation, they may not fully capture the intricate interactions between features learned by the models.

Conclusion

Our findings uncover a shortcut learning effect in deep learning-driven AD classification, demonstrating that models predominantly rely on volumetric features rather than microstructural changes in gray and white matter. This highlights the critical need to evaluate data selection and preprocessing workflows to distinguish between artifacts and true disease-specific patterns, ensuring clinical relevance.

The implications of this work extend beyond AD classification, urging the adoption of robust strategies for disentangling artifacts from meaningful features in deep learning workflows. Model validation pipelines should routinely test model sensitivity to preprocessing variations and data handling choices before clinical deployment. Future studies should incorporate quantitative MRI parameters, such as T1, R2*, or QSM, to provide insights into disease pathology and enhance model interpretability and generalizability. By addressing these challenges, the field can advance toward more reliable and clinically actionable applications of AI in neuroimaging.

Acknowledgements

We thank Lukas Pirpamer for the feedback provided during the creation of this manuscript. Data collection and sharing for this project were funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. It is available as a preprint on arXiv (https://arxiv.org/abs/2501.15831). Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (https://adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators can be found at: https://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Author contributions

Conceptualization: C.T., C.L., Data curation: C.T., C.L., R.Sc. Formal analysis: C.T., M.S., C.L., Funding acquisition: C.L., Investigation: C.T., M.S., C.L., Methodology: C.T., M.S., C.L., Project administration: C.L., Resources: C.T., C.L., Software: C.T., Supervision: R.St., S.R., C.L., Validation: C.T., M.S., R.St., S.R., C.L., Visualization: C.T., Writing—original draft: C.T., C.L., Writing—review & editing: C.T., M.S., R.St., R.Sc., S.R., C.L.

Funding

This study was funded by the Austrian Science Fund (FWF grant numbers: P30134, P35887).

Data availability

The MR images from our local datasets are not publicly available. Formal data sharing requests to the corresponding authors will be considered.

Code availability

The code and the image IDs (ADNI images) used in this study are available under https://github.com/christiantinauer/binADNI.

Declarations

Ethics approval and consent to participate

Data used in this study were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (https://adni.loni.usc.edu). The ADNI study was conducted in accordance with the ethical standards of each participating institution’s Institutional Review Board and with the 1964 Helsinki declaration and its later amendments. Written informed consent was obtained from all participants or their authorized representatives at the time of enrollment. The current study involved only secondary analyses of fully de-identified data provided by ADNI, and no additional ethical approval was required. All analyses were conducted in accordance with the ADNI Data Use Agreement. Additionally, this study used data acquired in local studies approved by the ethics committee of the Medical University of Graz (IRB00002556), and signed informed consent was obtained from all study participants or their caregivers. The trial protocol for this prospective study was registered at the National Library of Medicine (trial identification number: NCT02752750). All methods were performed in accordance with the relevant guidelines and regulations.

Consent for publication

Signed informed consent was obtained from all study participants or their caregivers.

Competing interests

The authors declare no competing interests.

Abbreviations

Alzheimer’s disease

ADNI

Alzheimer’s Disease Neuroimaging Initiative

AUC

Area under receiver operating characteristic curve

CNN

Convolutional neural network

EMD

Earth mover’s distance

IoU

Intersection over union

LRP

Layer-wise relevance propagation

MRI

Magnetic resonance imaging

MSSIM

Mean structural similarity index measure

Normal control

PET

Positron emission tomography

Relevance score

RMSE

Root mean squared error

T1w

T1-weighted

Supplementary information

The online version contains supplementary material available at https://doi.org/10.1186/s13244-025-02158-4.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Scheltens, P; De Strooper, B; Kivipelto, M et al. Alzheimer’s disease. Lancet; 2021; 397, pp. 1577-1590.1:CAS:528:DC%2BB3MXhsVegsL7F [DOI: https://dx.doi.org/10.1016/S0140-6736(20)32205-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33667416][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354300]

2. Alzheimer’s Association (2024) 2024 Alzheimer’s disease facts and figures. Alzheimers Dement 20:3708–3821. https://doi.org/10.1002/alz.13809

3. Beach, TG; Monsell, SE; Phillips, LE; Kukull, W. Accuracy of the clinical diagnosis of Alzheimer disease at National Institute on Aging Alzheimer Disease Centers, 2005–2010. J Neuropathol Exp Neurol; 2012; 71, pp. 266-273. [DOI: https://dx.doi.org/10.1097/NEN.0b013e31824b211b] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22437338]

4. Dubois, B; Villain, N; Frisoni, GB et al. Clinical diagnosis of Alzheimer’s disease: recommendations of the International Working Group. Lancet Neurol; 2021; 20, pp. 484-496.1:CAS:528:DC%2BB3MXhtVOlsb3E [DOI: https://dx.doi.org/10.1016/S1474-4422(21)00066-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33933186][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8339877]

5. Sluimer, JD; Vrenken, H; Blankenstein, MA et al. Whole-brain atrophy rate in Alzheimer disease: identifying fast progressors. Neurology; 2008; 70, pp. 1836-1841.1:STN:280:DC%2BD1c3ps1emtw%3D%3D [DOI: https://dx.doi.org/10.1212/01.wnl.0000311446.61861.e3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/18458218]

6. Henneman, WJP; Sluimer, JD; Barnes, J et al. Hippocampal atrophy rates in Alzheimer disease: added value over whole brain volume measures. Neurology; 2009; 72, pp. 999-1007.1:STN:280:DC%2BD1M3ht1agug%3D%3D [DOI: https://dx.doi.org/10.1212/01.wnl.0000344568.09360.31] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19289740][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2821835]

7. Wen, J; Thibeau-Sutre, E; Diaz-Melo, M et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med Image Anal; 2020; 63, [DOI: https://dx.doi.org/10.1016/j.media.2020.101694] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32417716]101694.

8. Davatzikos, C. Machine learning in neuroimaging: progress and challenges. Neuroimage; 2019; 197, pp. 652-656. [DOI: https://dx.doi.org/10.1016/j.neuroimage.2018.10.003] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30296563]

9. Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In: Bengio Y, LeCun Y (eds) Proceedings of the 2nd international conference on learning representations (ICLR 2014), Banff, 14–16 April 2014

10. Bach, S; Binder, A; Montavon, G et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One; 2015; 10, e0130140.1:CAS:528:DC%2BC2MXhsVemu7zM [DOI: https://dx.doi.org/10.1371/journal.pone.0130140] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26161953][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4498753]

11. Lapuschkin, S; Wäldchen, S; Binder, A et al. Unmasking Clever Hans predictors and assessing what machines really learn. Nat Commun; 2019; 10, 1:CAS:528:DC%2BC1MXnvFyrt7s%3D [DOI: https://dx.doi.org/10.1038/s41467-019-08987-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30858366][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6411769]1096.

12. Bottani, S; Burgos, N; Maire, A et al. Evaluation of MRI-based machine learning approaches for computer-aided diagnosis of dementia in a clinical data warehouse. Med Image Anal; 2023; 89, [DOI: https://dx.doi.org/10.1016/j.media.2023.102903] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37523918]102903.

13. Wallis, D; Buvat, I. Clever Hans effect found in a widely used brain tumour MRI dataset. Med Image Anal; 2022; 77, [DOI: https://dx.doi.org/10.1016/j.media.2022.102368] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35063892]102368.

14. Pfungst Oskar, Pfungst O, Rahn CL (1911) Clever Hans (the horse of Mr. Von Osten) a contribution to experimental animal and human psychology. H. Holt and Company, New York

15. Geirhos, R; Jacobsen, J-H; Michaelis, C et al. Shortcut learning in deep neural networks. Nat Mach Intell; 2020; 2, pp. 665-673. [DOI: https://dx.doi.org/10.1038/s42256-020-00257-z]

16. Tinauer, C; Heber, S; Pirpamer, L et al. Interpretable brain disease classification and relevance-guided deep learning. Sci Rep; 2022; 12, 1:CAS:528:DC%2BB38XjtVelsb%2FK [DOI: https://dx.doi.org/10.1038/s41598-022-24541-7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36424437][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9691637]20254.

17. Canu, E; McLaren, DG; Fitzgerald, ME et al. Mapping the structural brain changes in Alzheimer’s disease: the independent contribution of two imaging modalities. J Alzheimers Dis; 2011; 26, pp. 263-274. [DOI: https://dx.doi.org/10.3233/JAD-2011-0040] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21971466][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3267543]

18. Vásquez-Venegas C, Wu C, Sundar S et al (2024) Detecting and mitigating the Clever Hans effect in medical imaging: a scoping review. J Imaging Inform Med. https://doi.org/10.1007/s10278-024-01335-z

19. Kline, A; Luo, Y. PsmPy: a package for retrospective cohort matching in Python. Annu Int Conf IEEE Eng Med Biol Soc; 2022; 2022, pp. 1354-1357. [DOI: https://dx.doi.org/10.1109/EMBC48229.2022.9871333] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36086543]

20. Schmidt, R; Enzinger, C; Ropele, S et al. Progression of cerebral white matter lesions: 6-year results of the Austrian Stroke Prevention Study. Lancet; 2003; 361, pp. 2046-2048. [DOI: https://dx.doi.org/10.1016/s0140-6736(03)13616-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/12814718]

21. Damulina, A; Pirpamer, L; Soellradl, M et al. Cross-sectional and longitudinal assessment of brain iron level in Alzheimer disease using 3-T MRI. Radiology; 2020; 296, pp. 619-626. [DOI: https://dx.doi.org/10.1148/radiol.2020192541] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32602825]

22. Jenkinson, M; Beckmann, CF; Behrens, TEJ et al. FSL. Neuroimage; 2012; 62, pp. 782-790. [DOI: https://dx.doi.org/10.1016/j.neuroimage.2011.09.015] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21979382]

23. Tustison, NJ; Avants, BB; Cook, PA et al. N4ITK: improved N3 bias correction. IEEE Trans Med Imaging; 2010; 29, pp. 1310-1320. [DOI: https://dx.doi.org/10.1109/TMI.2010.2046908] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20378467][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3071855]

24. Smith, SM; Zhang, Y; Jenkinson, M et al. Accurate, robust, and automated longitudinal and cross-sectional brain change analysis. Neuroimage; 2002; 17, pp. 479-489. [DOI: https://dx.doi.org/10.1006/nimg.2002.1040] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/12482100]

25. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller MA (2015) Striving for simplicity: the all convolutional net. In: Bengio Y, LeCun Y (eds) Proceedings of the 3rd international conference on learning representations (ICLR 2015), San Diego, 7–9 May 2015

26. Montavon, G; Samek, W; Müller, K-R. Methods for interpreting and understanding deep neural networks. Digital Signal Process; 2018; 73, pp. 1-15. [DOI: https://dx.doi.org/10.1016/j.dsp.2017.10.011]

27. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) Proceedings of the 3rd international conference on learning representations (ICLR 2015), San Diego, 7–9 May 2015

28. Bouthillier X, Delaunay P, Bronzi M et al (2021) Accounting for variance in machine learning benchmarks. In: Smola A, Dimakis A, Stoica I (eds) Proceedings of machine learning and systems. pp 747–769

29. Bradshaw, TJ; Huemann, Z; Hu, J; Rahmim, A. A guide to cross-validation for artificial intelligence in medical imaging. Radiol Artif Intell; 2023; 5, [DOI: https://dx.doi.org/10.1148/ryai.220232] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37529208][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388213]e220232.

30. Dietterich, TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput; 1998; 10, pp. 1895-1923.1:STN:280:DC%2BC2sjotVartA%3D%3D [DOI: https://dx.doi.org/10.1162/089976698300017197] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/9744903]

31. Westfall, PH; Troendle, JF; Pennello, G. Multiple McNemar tests. Biometrics; 2010; 66, pp. 1185-1191. [DOI: https://dx.doi.org/10.1111/j.1541-0420.2010.01408.x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20345498][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2902578]

32. von Luxburg, U. A tutorial on spectral clustering. Stat Comput; 2007; 17, pp. 395-416. [DOI: https://dx.doi.org/10.1007/s11222-007-9033-z]

33. van der, ML; Hinton, G. Visualizing data using t-SNE. J Mach Learn Res; 2008; 9, pp. 2579-2605.

34. Böhle, M; Eitel, F; Weygandt, M; Ritter, K. Layer-wise relevance propagation for explaining deep neural network decisions in MRI-based Alzheimer’s disease classification. Front Aging Neurosci; 2019; 11, 194. [DOI: https://dx.doi.org/10.3389/fnagi.2019.00194] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31417397][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6685087]

35. Serra, L; Cercignani, M; Lenzi, D et al. Grey and white matter changes at different stages of Alzheimer’s disease. J Alzheimers Dis; 2010; 19, pp. 147-159. [DOI: https://dx.doi.org/10.3233/JAD-2010-1223] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20061634]

36. Leonardsen, EH; Persson, K; Grødem, E et al. Constructing personalized characterizations of structural brain aberrations in patients with dementia using explainable artificial intelligence. NPJ Digit Med; 2024; 7, 110. [DOI: https://dx.doi.org/10.1038/s41746-024-01123-7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38698139][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11066104]

37. Eitel, F; Soehler, E; Bellmann-Strobl, J et al. Uncovering convolutional neural network decisions for diagnosing multiple sclerosis on conventional MRI using layer-wise relevance propagation. Neuroimage Clin; 2019; 24, [DOI: https://dx.doi.org/10.1016/j.nicl.2019.102003] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31634822][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6807560]102003.

38. Hofmann, SM; Beyer, F; Lapuschkin, S et al. Towards the interpretability of deep learning models for multi-modal neuroimaging: finding structural changes of the ageing brain. Neuroimage; 2022; 261, [DOI: https://dx.doi.org/10.1016/j.neuroimage.2022.119504] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35882272]119504.

39. Hofmann, SM; Goltermann, O; Scherf, N et al. The utility of explainable AI for MRI analysis: relating model predictions to neuroimaging features of the aging brain. Imaging Neurosci; 2025; 3, [DOI: https://dx.doi.org/10.1162/imag_a_00497] imag_a_00497.

40. Dinsdale, NK; Bluemke, E; Smith, SM et al. Learning patterns of the ageing brain in MRI using deep convolutional networks. Neuroimage; 2021; 224, [DOI: https://dx.doi.org/10.1016/j.neuroimage.2020.117401] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32979523]117401.

41. Mattia, GM; Villain, E; Nemmi, F et al. Investigating the discrimination ability of 3D convolutional neural networks applied to altered brain MRI parametric maps. Artif Intell Med; 2024; 153, [DOI: https://dx.doi.org/10.1016/j.artmed.2024.102897] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38810471]102897.

42. Kushol, R; Parnianpour, P; Wilman, AH et al. Effects of MRI scanner manufacturers in classification tasks with deep learning models. Sci Rep; 2023; 13, 1:CAS:528:DC%2BB3sXitVyltb7P [DOI: https://dx.doi.org/10.1038/s41598-023-43715-5] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37798392][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10556074]16791.

43. Bhagwat, N; Barry, A; Dickie, EW et al. Understanding the impact of preprocessing pipelines on neuroimaging cortical surface analyses. Gigascience; 2021; 10, [DOI: https://dx.doi.org/10.1093/gigascience/giaa155] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33481004][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7821710]giaa155.

44. Lu, Y-C; Zuo, L; Chou, Y-Y et al. An evaluation of image-based and statistical techniques for harmonizing brain volume measurements. Imaging Neurosci; 2025; 3, [DOI: https://dx.doi.org/10.1162/IMAG.a.73] IMAG.a.73.

45. Marzi, C; Giannelli, M; Barucci, A et al. Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets. Sci Data; 2024; 11, [DOI: https://dx.doi.org/10.1038/s41597-023-02421-7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38263181][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10805868]115.

46. Tinauer C, Damulina A, Sackl M et al (2024) Explainable concept mappings of MRI: revealing the mechanisms underlying deep learning-based brain disease classification. In: Longo L, Lapuschkin S, Seifert C (eds) Explainable artificial intelligence. Springer Nature, Cham, pp 202–216

47. Tinauer C, Sackl M, Ropele S, Langkammer C (2025) Identifying Alzheimer’s disease prediction strategies of convolutional neural network classifiers using R2* maps and spectral clustering. In: Proceedings of 33rd European signal processing conference (EUSIPCO 2025). pp 1497–1501

48. Malhi, BS; Lo, J; Toto-Brocchi, M et al. Quantitative magnetic resonance imaging in Alzheimer’s disease: a narrative review. Quant Imaging Med Surg; 2025; 15, pp. 3641-3664. [DOI: https://dx.doi.org/10.21037/qims-24-1602] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40235823][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11994541]

Word count: 5710

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Skull-stripping induces shortcut learning in MRI-based Alzheimer’s disease classification

Content area

Abstract

Full text