Full Text

Turn on search term navigation

1 Introduction

Machine learning techniques, particularly deep learning (DL) models, have achieved remarkable results in various fields of computer vision, including image recognition, autonomous vehicles, robotics, and motion recognition [1–4]. These models have also revolutionized medical image analysis, leading to significant advancements in disease diagnosis, medical imaging interpretation, and personalized treatment planning [5–8].

However, deploying these models in real-world applications presents several challenges. These include confounding factors [9–11], multiple salient features, and a lack of interpretability. For instance, Sagawa et al. [2020b] found that models trained on the Waterbirds dataset tend to associate waterbirds with water-containing backgrounds, while those trained on the CelebA dataset [12] associate males with dark hair. Medical images are particularly challenging due to their complexity, often containing multiple critical features within a single image. For example, a clinical image for skin cancer diagnosis might display various features such as atypical pigmentation, irregular streaks, asymmetry, border irregularities, and color variations [13, 14]. Accurate diagnosis and classification of melanoma require considering these interconnected features simultaneously.

In high-stakes domains like medical image analysis and autonomous vehicles, explanations are essential rather than optional [15–20]. Systems in these fields must provide explanations alongside their predictions to ensure reliability, fairness, and transparency while addressing latent defects [21–26]. Thus, techniques to interpret and understand DL models are crucial for real-world applications. However, most existing techniques focus on quantifying the contribution of low-level input features, such as raw pixel values, to specific decisions [27–33]. These methods, including perturbation-based [29] and gradient-based [27, 30, 34, 35] techniques, typically produce saliency maps that show the relationship between individual pixels and model output. However, such explanations are instance-level and do not provide insights into the global behavior of the model. Kim et al. found that these explanations do not enhance users’ understanding or trust in the model, and similar visual explanations can lead to different interpretations by different users [36].

Domain experts often seek to understand the underlying reasoning process of the model rather than just the final decision. The above-stated type of explanations frequently fails to align with human thinking and reasoning at the concept level. Recently introduced, concept-based explanation methods aim to bridge this gap by providing more intuitive, human-aligned explanations focused on higher-level concepts [36–42]. Additionally, these methods offer class-wise explanations while characterizing the model’s global behavior and reasoning process. However, they may provide irrelevant explanations and lack a user-guided explanation generation mechanism. Most methods either quantify the importance of concepts for predictions [36] or do not fully harness concepts to explain model inference.

A recent study [43] proposed an explainable deep learning model to address these issues but relies on a pre-trained image segmentation model (trained on ImageNet dataset) to remove irrelevant image parts, limiting its applicability to domains like medical and agriculture, not covered by ImageNet. Additionally, the model keeps concepts for each class distinct, posing challenges in fine-grained classification and medical domains, where different classes may share similar concepts. This limitation also affects its scalability in large-scale applications with many classes and concepts.

To address these challenges, we propose a two-stream explainable deep learning model inspired by the dual-process theory of human cognition [44, 45]. This theory categorizes human thinking into two systems: System-I, a fast, implicit, intuitive, and non-transparent process, and System-II, a slow, rational, controllable, and analytical process. We represent System-I with a Convolutional Neural Network (CNN) for rapid, implicit pattern recognition and System-II with a cross-attention concept memory network for transparent, controllable, and logical reasoning.

We employ an unsupervised segmentation approach [46] to effectively disentangle objects from the background, addressing the challenge of limited training data [47] and overcoming constraints related to object segmentation mainly available for ImageNet classes [43]. The global features (concepts) are extracted using Kim et al.’s technique [36] from the segmented objects. To address issues identified by Zia et al. [43], we apply a two-step concept filtering process: a) manual selection and b) similarity screening. The manual selection in CA-SoftNet allows users to control and interpret the concept learning process, distinguishing it from many existing methods [48–53]. Similarity screening reduces the number of concepts by selecting the most salient ones for each class. In fine-grain classification, where different classes may share similar features, our model allows sharing similar concepts, thus inducing human-like cognition and reasoning. The model uses the resultant salient concepts for training and generating high-level, human-understandable explanations. This approach bridges the gap between global and local explanations, providing transparent, comprehensible reasoning while requiring less abstraction and generalizing from smaller data sets. The distinctive two-stream architecture of our model enables explicit control over the trade-off between interpretability and efficacy.

Our key contributions can be summarized as follows:

* Novel Concept Extraction Method: We present a refined concept extraction technique that ensures the selection of relevant concepts, thereby improving the accuracy of the generated explanations.

* User-Guided Explanation Generation: Our approach allows users to guide the model in generating local explanations using terms that are significant to them, making the explanations more meaningful.

* CA-SoftNet Model: By focusing on salient concepts, our method reduces the number of concepts needed for class representation, which enhances scalability for large datasets. Additionally, it facilitates the sharing of similar concepts across different classes, promoting human-like cognition in the model’s reasoning and inference processes.

* Experimental Validation: Our experiments on the CUB 200-2011, Stanford Cars, ISIC 2016, and ISIC 2017 datasets demonstrate that our model outperforms reported interpretable models and performs comparably to non-interpretable models.

These contributions collectively advance the field of interpretable deep learning models by providing insights into concept-based explanations and their application across various domains. The paper is organized as follows: In section 2, we discussed related work in interpretable deep learning models, concept-based interpretable models, and fine-grain classification models. Section 3 architecture of the proposed model and Experiments are discussed, and in section 4, we analyze and discuss the results obtained from the experiments. In the section 5, we summarize the key contributions of our work and highlight the implications and potential applications of CA-SoftNet.

2 Related work

Many techniques devised to provide visual explanations of deep learning models can be broadly categorized into post-hoc techniques and ante-hoc or self-explanatory models. The following section briefly discusses each one.

2.1 Post-hoc attribution methods

In the case of post-hoc, several techniques coin saliency maps as a visual explanation. These saliency maps highlight the attribution of each input image pixel, thus providing local explanations for each decision. One such class of methods includes gradient-based approaches such as CAM [54], grad-CAM [30], guided grad-CAM [30] and augmented grad-CAM++ [55], which generates saliency maps by feeding the gradients of any target class into the final convolutional layer. Other popular approaches include back-propagation [56], layer-wise relevance propagation [57], activation maximization [58–60] and deconvolution [33]. LIME [61] and B-LIME [62] explain any classifier’s predictions by learning a local interpretable model around the prediction. The interpretation of Shapely values (SHAP) [63] includes the most (and least) vital parts of input images. Although post-hoc explanations don’t require retraining or changes in the model, they produce fragile and unfaithful explanations [38].

A recently developed model leverages an adapted Convolutional Neural Network (CNN) architecture and is trained using clinical images of skin cancer [64]. This multi-output incremental diagnostics model employs the Inception V3 network to extract features, which are then passed to an incremental diagnosis block instead of conventional classification layers. This approach enables the model to learn the taxonomy of skin lesions effectively. Each diagnostic level in the model receives predictions from the preceding block and the features extracted by the Inception model at the previous stage. Consequently, Levels 2 and 3 of the diagnostic block acquire detailed information regarding lesion origin and malignancy, respectively. The model combines Cross Entropy (CE) and Taxonomy Cross Entropy (TCE) loss functions to account for individual and combined stage loss. A Class Activation Mapping (CAM) based approach is employed to elucidate the model’s reasoning process, which highlights the salient regions of the input image that are most influential in the model’s predictions.

A novel architecture, the Interpretability-based Multimodal Convolutional Neural Network (IM-CNN), has been recently introduced [65]. This architecture features a three-branched design. Branch 1 processes patient metadata and integrates this information with features extracted from the segmented lesion in Branch 2. These combined features, representing domain knowledge, are analyzed using SHAP (SHapley Additive exPlanations) to determine the importance score of each contributing feature. In Branch 3, EfficientNet-B5 is utilized as the backbone to compute the lesion’s local and global features. These extracted features are then processed by Grad-CAM (Gradient-weighted Class Activation Mapping), which generates a corresponding heatmap highlighting the regions in the input image responsible for a particular decision. This architecture effectively leverages multimodal input, employing Grad-CAM for image-based learning and SHAP for metadata-based learning, thereby enabling comprehensive multimodal visual explanation analysis.

All these post-hoc visualization methods suffer from some common issues, i.e., the provided explanations are not explicit enough and at a coarse level (pixels), so the humans need to map the relationship between the explanation and the underlying reasoning process of the model. On the other hand, human perception is liable to error and often results in contradictory conclusions for similar explanations [36]. On the other hand, humans are used to thinking and reasoning abstractly and classifying objects based on high-level human-understandable concepts, unlike these low-level (pixels) details. The dependency of such a post-hoc method on low-level features also hinders its application on high-dimensional datasets, e.g., medical. These explanations don’t provide much insight into reasoning, as they cannot tell which features contributed to making a prediction. Most importantly, these methods cannot provide explanations in terms of concepts that are vital to the user (e.g., they lack customizability and controllability).

2.2 Concept attribution methods

Recently, much work has been done in quantifying the model’s outcome based on understandable human “concepts” [36, 38, 66, 67]. These concepts, known as Concept Activation Vectors (CAVs), represent objects as whole or semantic parts of them. One of the earliest works done in this field is by [36], which uses directional derivatives of target concepts over the images to quantify their importance for classifying a particular class, and [68] dissects the input image into concepts. Both these methods require labeled concepts, thus obstructing their application on a large scale and in fields where labeled concepts are hard to produce. Ghorbani et al. [66] extracted these concepts automatically by grouping similar low-level features into concepts that appear coherently and contribute to the correct prediction of a particular class. Much of the work on concept-based explanation methods produces global explanations, but humans require plausible local explanations.

Although the explanations generated by the post-hoc techniques are easy to employ as they don’t require changes in the architecture of a model, many researchers have argued about the authenticity of these explanations. As exclusive models generate these explanations, they might fail to produce actual explanations that reveal the underlying reasoning mechanism of the models exclusively, which is desirable to develop human trust in these models. Such complexities urged many researchers, e.g., C. Rudin et al., [69], to use self-explanatory Deep learning models with built-in reasoning and classification functionality.

Recently, a novel dual branch-based Concept-Attention Whitening (CAW) framework has been introduced to classify skin melanoma [70]. The CAW framework divides this task into two branches: the disease diagnosis branch and the concept alignment branch. The disease diagnosis branch employs a CNN-based classification network, incorporating a novel CAW layer for concept discovery. This branch facilitates the creation of a concept dataset by extracting multiple disentangled salient concepts from individual images. The concept alignment branch then computes the concept-attention score, highlighting each input concept’s contribution by aligning it with the concepts discovered in the disease diagnosis branch using a weakly supervised concept mask. This intrinsic concept-based explainable network provides both visual concept-based explanations and textual explanations based on activation values. Although the Concept-Attention Whitening framework offers visual and textual explanations, it may face challenges in handling complex cases where concepts are not easily disentangled. The weakly supervised concept mask used for alignment might also introduce inaccuracies if the alignment is not precise. Additionally, the effectiveness of the CAW framework is highly dependent on the quality of the concept extraction process, which may not always capture all relevant features, especially in varied or noisy data.

Concept bottleneck models have recently gained prominence for their ability to provide interpretable deep-learning models for image classification. Typically, an expert is required to build a correlation between the concept and the input image [71] in such models, which is often challenging or prohibitively expensive in the medical field.

A recent study used these models and introduced a self-supervision method for discovering concepts, employing slot-attention to highlight relevant regions [72]. The classifier then correlates these concepts with the input image, enhancing interpretability. The model is regularized, and new concepts are discovered using a combination of reconstruction and contrastive loss, which optimizes concepts for specific tasks. Another work used the Large Language Model (LLM) and concept bottleneck models to provide an interpretable solution for the medical domain [73]. First, text-based concepts are extracted from an LLM, specifically GPT-4. These textual concepts are then aligned with medical concepts using a vision-language model (VLM). The resultant heatmap, derived from the cosine distance, measures the similarity between the concepts and the input image. Although this model offers both global and local explanations, these explanations are not controlled by the users and in terms that are significant to them.

2.3 Attention-based interpretable models

Our work closely relates to attention-based interpretable CNN models for Fine-grain visual classification (FGVC) [74–77]. FGVC-based architectures focus on segregating visually hard-to-differentiate object classes. This task is quite challenging due to small interclass and large intra-class variances based on factors such as viewpoint, occlusion, and pose. Most of the existing attention-based FGVC methods work by identifying and quantifying the attribution of the discriminative parts for the classification task. In a recent work [78], to provide interpretability a region-based part discovery network is integrated with a regional-attention module. The whole task can be divided into three subtasks: (i) Part Segmentation and Regularization, (ii) Partwise Feature extraction & Attribution, and (iii) Attention-based classification. The model uses class-level image labels and novel part occurrence regularization for part discovery and segmentation. The discovered object parts are then fed into a subnetwork with an integrated attention module to discover the attribution of each part. The model leverages the region-based part features and regional attention for the final prediction. The model outputs a) a part assignment map, b) an attention map, and c) a predicted label corresponding to each subtask mentioned earlier. Another related work is done in [48], where a novel CNN-based multi-attention multi-class model is trained to regulate multiple parts of objects. Attention-based features of multiple parts of objects are extracted using a Squeeze and excitation module. The metric learning framework implies a multi-attention multi-class (MAMC) loss constraint to ensure coherency among the object parts of similar classes. The learned features of extracted parts, along with the class labels, are used for the final classification task. These methods are trained without utilizing the bounding boxes or part annotations of input images like our proposed model. Although these models can demonstrate which part of input images contributes how much while deciding, they cannot show the prototypical examples of cases like these parts. Unlikely, our proposed model identifies the discriminative parts of the input image and showcases similar prototypical cases of user-defined concepts.

A recent development in fault diagnosis for industrial activities involves introducing a CNN-based attention model [79]. This method transforms multivariate time-series data into an image-like format through a sliding window processing technique, leveraging CNN for feature extraction. To enhance fault feature extraction, the model incorporates prior knowledge of category attribution using an attention mechanism within the CNN. Visualization techniques confirm the efficacy of integrating prior knowledge for improved model performance and interpretability. However, a limitation is identified in the reliance on label information for defining prior knowledge, restricting its application in domains where labels are not easily accessible.

The most brutal problem with these attention-based models is that the explanation is based on low-level features, which are not plausible for humans and require further introspection by humans to conclude the model’s inference.

2.4 Case-based reasoning deep learning models

Our work is closely related to Case-Based Reasoning (CBR) deep learning models, which use related instances from datasets alongside the model’s predictions to explain the underlying inference process. Many of these models are based on the k-Nearest Neighbour (KNN) algorithm, which enhances interpretability by providing exemplar input points (prototypes) that resemble the predicted data points [49]. Another significant contribution to this field is the work by Li et al. [80], where prototypes are learned within the network using a prototype layer as encoded input. These learned prototypes can be visualized using a dedicated decoder module. However, the provided prototypes are often unrealistic, representing the latent space of training images. Another issue with such models is that they explain outputs in terms of whole input images. For fine-grained visual categorization (FGVC) tasks, however, we require part-based prototypes to provide detailed local explanations.

ProtoPNet [50], PIP-Net [81] ProtoPShare [51] and other models [52, 53] based on them offer fine-grained classification along with plausible, part-wise explanations. In these models, the prototypical parts of training images from each class are learned and compared to crucial parts of input images, like how human inference works. The classification layer then predicts based on the similarity score computed between the image activation map and prototypical part representations, much like our proposed methodology.

However, a major issue with this class of interpretable models is that the learned prototypes are not controllable (e.g., user-defined), and some learned concepts correspond to irrelevant backgrounds [51]. These issues are addressed by recent work from Zia, Tehseen, et al., [43] in which these concepts are controllable and used to produce an explainable classification model. Nonetheless, this approach requires a pre-trained Mask-RCNN for object segmentation (background removal), which can only be used for classes present in ImageNet. Additionally, the proposed model is neither scalable nor robust enough, as it computes the output for each concept of all classes individually, resulting in a time-consuming process during both the training and testing phases. Another significant problem with the works of Kim [36], Ghorbani [66], and Zia [43]is that they do not consider that in fine-grained classification, some classes might share common concepts (e.g., black and white stripes are common to both zebras and zebra crossings). This issue could lead to misleading or confusing the classifier if not properly addressed. However, addressing this problem would allow the model to exploit similarities among classes, helping it to deduce more logically, like human cognition.

2.5 Dual process-based models

To characterize the human cognitive and decision-making process, the intuitive judgment made by the dual-process theory is widely accepted as the dominant explanation. Kahneman et al., [44] believe that two types of thought processes co-exist; the first one is a fast, automatic, experimental, implicit, associative, intuitive process (System-I), and the other one is a slow, controlled, analytical, explicit, rational reasoning process (System-II). System-I, associated with Fast thinking and impulsive actions, corresponds to tasks of cognitive ease, e.g., driving, brushing teeth, eating pizza, and telling someone our name. It performs the tasks with minimum effort and attention but requires consistent practice to achieve automaticity. System-I is intuited-based, which works on heuristics that might be wrong and normally don’t offer explanations. System-I performs well in familiar situations, such as automatic pattern recognition or heuristic evaluation, and can be represented by some deep learning classification models. System II, corresponding to the slow thinking of dual-process theory, is conscious, deliberate, and controlled, thus providing analytical reasoning. System II is critical to tasks that require more memory, attention, and effort, e.g., stock market analysis, solving complex mathematical equations, etc. Its exclusive capability is to uncover hidden and complex relationships among data using logic, optimization, and prior knowledge while providing logical reasoning for a decision using introspection. It must logically explain the task before concluding a decision and closely resembles memory-based attention systems. The judgment made by these two systems may negate one another but often is required to provide a better judgment. Kahneman et al., [82] believe that overall intuitive judgment depends upon the interplay of these competing systems. Recently, many researchers have used the concept of dual-process theory while proposing different architectures in Deep learning to mimic human reasoning and decision processes.

Several researchers inspired by dual-process theory have devised several architectures to address different problems. Influenced by dual-process theory, Chen et al. [83] introduced DRNets, which couple prior knowledge with the problem structure by combining logical reasoning with neural network optimization. DRNets encode the input data in latent space to exploit the structure and prior knowledge among the data points. Encoded input is decoded using a generative decoder to achieve the desired output. For the final evaluation, the output from the decoder (Fast thinking) is combined with that of the reasoning module (Slow Thinking) and ultimately optimized using constraint-aware stochastic gradient descent. This algorithm is applied to convoluted problems like Multi MNIST-Sudoku, Boolean satisfiability problems (SAT), and Crystal-Structure-Phase-Mapping. Although this architecture offers an interpretable solution, it cannot provide controllable explanations like our proposed architecture.

Although the cross-attention models are accurate for image retrieval, they are slow and thus not suited for large-scale applications. Miech et al., [84] introduced a text-based search of images and videos using a vision transformer. The architecture is inspired by the Fast and Slow thought theory and uses Dual Encoders (DE) as a Fast module and a cross-attention (CA) based vision transformer as a slow module. DE maps text and images to create a joint latent space, where, based on similarity score, provides fast but not so accurate approximate nearest neighbors. The resultant proposed candidates are further scrutinized using the multimodal indexable CA module. The overall efficacy of this architecture is improved by first distilling knowledge from the slow but accurate cross-attention module into a fast yet not accurate DE module. The retrieved results are re-ranked using the CA module at test time, which compares all image segments with each word. The model is trained using bidirectional captioning loss, thus improving results for multi-modal image retrieval. Contrary to our proposed architecture, this model is not interpretable and thus inappropriate enough for critical domains where explanations are needed.

Anthony et. el., [85] introduced a novel reinforcement learning architecture (EXIT) to address planning and generalization tasks separately. Unlike traditional Reinforcement Learning (RL), which uses a Neural Network for both the discovery and generalization of plans, EXIT uses a Tree Search algorithm (slow and deliberate, analogous to System-II) to search for expert steps while using a Neural Network (fast and unconscious, analogous to System-I) for strong and fast intuition learning thus guiding the search. In this architecture, System-I imitates System-II initially, and upon consistent learning, it suggests steps for improvement in Expert policy at each iteration. This arrangement of systems allows fast intuition learning of System-I and state-of-the-art performance for challenging tasks, such as playing Hex. This algorithm has proved itself among the state-of-the-art heuristic search methods by beating MoHex 1.0 but fails to provide the interpretability required in critical tasks.

3 Materials and methods

In this section, we describe the proposed deep learning model architecture, designed to achieve four key objectives: 1) provide concept-based local explanations for the model’s decisions, 2) enable human control over the selection of salient concepts, 3) share similar concepts across different classes, and 4) offer user-guided local explanations. Architecture is inspired by dual-process theory, which distinguishes between fast, non-interpretable low-level processes and slow, interpretable high-level processes.

To align with this theory, we decompose shallow learning—focused on low-level representations—from deep learning, which involves reasoning with high-level concepts. We build a two-stream model architecture to address these distinct processes, as illustrated in Fig 1.

[Figure omitted. See PDF.]

Shallow Convolutional Neural Network here acts as System-I and Concept Memory Network as System-II of the Dual Process theory of human cognition.

For shallow learning, we utilize a shallow convolutional neural network (sCNN), which is essential for exploiting inductive biases present in the data. We employ a cross-attentional memory network for high-level learning, allowing the model to reason with high-level concepts.

To enhance controllability and offer user-guided explanations, we enable users to define the high-level concepts that the model uses during training and predictions. To support this, we propose a methodology for extracting human-understandable concepts from a given dataset of images. This concept extraction method is detailed in subsection 3.1, while the overall model architecture is described in subsection 3.2.

3.1 Concept mining

The model employs an automatic concept extraction mechanism inspired by Ghorbani, Amirata, et al. [66]. To tackle the issue of background inclusion within these concepts, the model integrates a U-Net architecture that is trained in an unsupervised manner. This U-Net effectively learns to disentangle object foregrounds from backgrounds by being alternately trained with a layered Generative Adversarial Network (GAN) [46]. The GAN helps refine the segmentation by generating high-quality foreground and background masks, ensuring that the extracted concepts are focused on the object of interest rather than irrelevant background details.

For further segmentation of the input-extracted images, the model employs Simple Linear Iterative Clustering (SLIC) [86], which partitions each image into superpixels. These superpixels are then grouped into clusters, representing distinct concepts based on semantic similarity. This similarity is computed using feature embeddings from a pre-trained ResNet34 model [87]. The ResNet34 captures rich hierarchical features from the image, which are then utilized to measure the closeness of the segments in the feature space. The clustered segments form the basis of the concepts used within the model, and each concept c is defined as a subset of an image I, such that c ⊂ I, allowing for a structured and semantically meaningful representation of the image content.(1)where c_ij represents the j^th segment in the i^th concept, |C_i| is the number of the segments that belong to the i^th concept and m denotes the number of the extracted concepts. Later a two-stage concept screening approach is applied to these concepts to choose most relevant concepts.

Manual Screening: To ensure controllability, users filter these concepts manually to choose the most relevant concepts among all available concepts.

K-Means Clustering and Screening: The purpose of this step is to achieve two goals: first, to reduce the number of concepts needed to represent a class; second, to share the symmetrical and common concepts among various classes. Each segment obtained from the previous manual screen step is passed through pre-trained CNN (e.g., VGG19 pre-trained on ImageNet) to map them as fixed length 512-dimensional vector v_c in the activation space. In this space, the Euclidean distance is used to measure the similarity among different segments, based on which similar segments are clustered together as a similar concept using K-Means clustering. Each cluster center’s nearest concept is chosen as the centroid of each cluster. This step helps represent the classes with an even smaller number of salient concepts and helps the network be scalable for larger datasets, each with many classes. Further methodology is discussed in the next section.

3.2 Proposed architecture

The core idea behind this proposed dual branch architecture is to segregate low-level feature processing from high-level feature processing, thus replicating the dual thought process of human cognition mentioned earlier (section 2.5). Each stream takes input image X, with dimension H_input x W_input x D_input and produces a vector with dimension D_output, which is concatenated to produce the final fully connected layer of this architecture Fig 1. We call these streams System-I and System-II, referring to the two systems of dual-process theory.

System-I is a simple shallow Convolution Neural Network (sCNN) including four (04) convolution layers and one (01) Dense Layer. System-I produces a vector v_cnn, as an output of this stream. System II is a cross-attention-based Concept Memory Network, which provides appropriate high-level human-understandable reasoning for the decision of the whole model. System-II pairs the segments (x₀, x₁, ….x_n) of input image X with the predefined concepts stored in the memory network and extracted using the mechanism elaborated in section 3.1. Each concept represents a subpart of some input image with size H_concept x W_concept, where H_concept ≤ H_input and W_concept ≤ W_input. For ease in calculation, both the segments of input images and concepts are represented as fixed-length vectors of the same size L (512), represented as v_i and v_c, respectively. These vectors are the output of the fully connected layers of pre-trained ResNet34 [87] and VGG19 [88].

In the context of our problem, we implemented a soft attention mechanism on the input image by leveraging selected concepts from the memory network. Mathematically, this can be represented in the following given form, where Q (Query) = Vectors of input segments (Concepts) and both K (Key), V (Value) are the vectors representing concepts from the memory network.(2)

The attention mechanism can be demonstrated as follows:

* Dot product: Similarity scores of all selected concepts are computed with all the concepts of the input image using dot product, similarity(q_i, k_i) = q_i ⋅ k_i.

* Scaling: The resultant similarity vector is scaled using the square root of the Key vector to stabilize the gradient, thus avoiding gradient vanishing or exploding problems.

* Attention score: SoftMax is applied on the scaled score vector to compute attention score, which depicts the importance of each Key (Selected Concept from the Memory Network (and hence Value) relative to the Query (Input Image concepts), attention .

* Weighted Sum: As a final step, each V Vector (all the concepts from the memory network) is multiplied by the corresponding attention score and sums the results to produce the final output of System-II as vector v_mem, where v_mem(i) = ∑_i attention weights(i, j) ⋅ v_j.

The vectors from both streams (v_cnn from System-I and v_mem from System-II) are fused to yield a single combined vector v_out on which softmax is applied to finalize the classifier. The input concepts and their relevant attention scores are used along with low-level features computed by the counter-part model to build an interpretable fine-grain image classifier.

3.3 Unsupervised object segmentation model training

For the unsupervised object segmentation, a layered GAN model [46] is trained on the CUB [89], Cars [90], ISIC 2016 [91] and ISIC 2017 [92] datasets separately, without providing any ground truth labels for the training images. The U-Net model is trained alternatively using Eq 3, is used to regularize the layered GAN model for each dataset separately.(3)

3.4 Main model training

For the training of the combined model, we have optimized the logarithmic loss function L_LL(xi)(4)

For N, the number of classes, where y_i is the ground truth label and p_i is the Softmax probability of i^th class. The penalty posed by this function is logarithmic as it yields a number, which depends on how far the expected label is from the actual label (e.g., large if the expected class is far from the actual one and small if the expected is close to the actual one). This loss function only optimizes System I, as the choice of concepts for System II is bounded by the user’s choice (concept controllability). The choice of concepts highly affects the efficacy of System II, so it should be made with very much care. Only the most relevant concepts should be chosen as the representation of each class.

3.5 Baseline models

We have implemented SoftNet [43], and ProtoPNet [50] as our baseline model to compare its results with our proposed model. Both these models are intrinsically interpretable and provide high-level explanations along with the models’ decisions.

3.6 Datasets and experimental settings

We utilize four (04) datasets CUB 200-2011 [89], Stanford Cars [90], ISIC 2016 [91] and ISIC 2017 [92] datasets for the evaluation of our proposed model.

CUB is widely used for fine-grain classification tasks with 200 classes of Birds. It contains 11,788 images divided into 5,994 training and 5,794 test images. The car dataset contains 16,185 images comprising a total of 196 classes, roughly divided into 8,144 training and 8,041 testing images. In both these datasets, several classes share common attributes such as the same color or texture of different parts of birds (beak, feather, head, tail, etc.), so it isn’t easy to distinguish them even for human beings, that’s why several researchers have used it to evaluate interpretable fine-grain classification models. It is even more challenging to extract and utilize distinct global concepts to employ an interpretable model.

Furthermore, we have used ISIC 2016 and ISIC 2017 skin cancer datasets in this paper. The ISIC 2016 dataset, introduced in December 2015, consists of 1,279 dermoscopic images selected from the ISIC Archive. After excluding 273 images from the 1,552 initially selected images, the remaining 1,279 images are divided into benign and malignant classes. The dataset is randomly divided into a training set of 900 images and a test set of 379 images. The ISIC 2017 dataset comprises 2,750 skin cancer images, including 2,000 training images, 150 validation images, and 600 test images. The dataset provides ground truth labels and patient metadata for lesion classification into three classes: melanoma (374), nevus (1,372), and seborrheic keratosis (254). The images range in size from 540 × 722 to 4499 × 6748, offering high resolution. Both these small-sized datasets are chosen to show that our proposed model, which employs shallow architecture, can get better efficacy when compared to related models.

As mentioned earlier in section 3, our proposed model is a two-stream system. System-I is a four-layered (04) CNN model with one (01) fully connected layer. These convolution layers are initialized with the weights of layers of VGG and ResNet pre-trained on ImageNet, thus corresponding to low-level features without achieving high-level feature abstraction (as desired by our architecture). System II is a cross-attention concept-based memory network. System-I and System-II both produce the same-sized vectors, which are concatenated to make the final layer of the model. For the Ablation study, System-II is turned ON and OFF to evaluate its impact on the overall performance of the whole system. Both these systems take images as input. System I performs low processing, whereas System II responds as a high-level reasoning module. We have also used images with and without background to test our proposed model. The code for this paper is publicly available online https://github.com/mirzaahsan1986/CASoftNet and minimum dataset at https://figshare.com/projects/InterpretableConvolutionNeuralNetworks/211759.

3.7 Concept extraction

For Concept extraction, we extended the technique that Ghorbani, Amirata, et al., [67] proposed. The concepts extracted from the CUB dataset using this technique are shown in Fig 2. The extracted concepts include irrelevant (vague and repetitive) concepts, mainly background portions.

[Figure omitted. See PDF.]

Some images from the CUB dataset and their corresponding concepts were extracted without removing irrelevant backgrounds.

To address this issue and disentangle objects from the background, we employed the U-Net model trained in parallel with layered-GAN in an unsupervised manner [46], as discussed in section 3.3. Using this trained U-Net model, we extract objects as shown in Figs 3–5 for CUB, Cars, ISIC 2016 and ISIC 2017 datasets respectively.

[Figure omitted. See PDF.]

The first and third column shows original images, whereas the second and fourth column shows their corresponding images with the background removed using U-Net [46].

[Figure omitted. See PDF.]

The first and third column shows original images, whereas the second and fourth column shows their corresponding images with the background removed using U-Net [46].

[Figure omitted. See PDF.]

Examples of Images from ISIC 2016 and ISIC 2017 datasets along with the extracted objects.

These extracted objects are segmented for concept extraction before passing to a pre-trained CNN classifier. For image segmentation, we employ different image segmentation techniques such as SLIC [86], Quickshift [93], Entropy Rate Superpixel (ERS) [94], and Felsenszwalb [95], to compare which technique aids in more relevant clustering. We provide images with and without background to these image segmentation methods. As it is apparent from the results shown in Table 1, SLIC performs better than any other mentioned earlier image segmentation technique. However, the image segments produced by Quickshift are much better than other techniques but not as meaningful as those produced by SLIC.

[Figure omitted. See PDF.]

The results of this experiment are shown in Table 1 for the CUB and Cars datasets, respectively.

3.8 Concept representation

To compare the efficacy of different pre-trained models for feature representation of the extracted concepts, we implement different state-of-the-art CNN models, including VGG19 and ResNet34, as shown in Table 2. These models accept fixed-size images as input, but the concepts extracted using the method stated in section 4.3 are of arbitrary size. We resize the concepts without considering their ratio to cope with this issue. It is quite evident from this study that vector representation obtained using ResNet34 is better than the other models mentioned earlier as this model is both Deep and Dense (Table 2). We further combined these two models for feature representation, and this combination performed even better, as evident from results in Table 2). We use the combined model, which we discussed later, for feature representation in our experiments.

[Figure omitted. See PDF.]

All the concepts considered vital for the particular class from each dataset are manually chosen, which are later reduced using K-Means clustering. Following the concept extraction phase, the application of K-means clustering serves two purposes: firstly, it aids in selecting the prominent concepts, and secondly, it helps the model leverage the similarity, i.e., common concepts, among distinct classes. This step is vital in incorporating a human-like thought process of recognizing similarities between images to predict their respective classes. Each potential concept is encoded as a 512-dimensional vector using a pretrained VGG16 weight W(.). These vectors are subsequently organized into m clusters, where m = number of classes x 10. Through this clustering process, we obtain m cluster centroids , and the nearest concept is designated as the centroid using K-means objective function where C_c denotes the nearest centroid to concept.

The images and the extracted concepts after background removal are shown in Figs 6–8 for CUB, Cars and ISIC datasets respectively.

[Figure omitted. See PDF.]

The first column from the left shows examples of images from the CUB dataset, whereas the first four columns from the right side show corresponding extracted concepts.

[Figure omitted. See PDF.]

The first column from the left shows examples of images from the Cars dataset, whereas the first four columns from the right side show corresponding extracted concepts.

[Figure omitted. See PDF.]

(a) The first column from the left shows examples of images from the ISIC 2017 dataset, whereas the next three columns show corresponding extracted concepts. (b) The first column from the left shows examples of images from the ISIC 2016 dataset, whereas the next three columns show corresponding extracted concepts.

3.9 Qualitative experiments

We conduct comprehensive qualitative human studies to enable expert and non-expert users to assess the performance of our proposed model and understand the internal reasoning mechanism.

Explanation Satisfaction: For an explanation to be more human-understandable, the explanation must possess some desired characteristics such as meaningfulness, coherence, and important [66]. We rate these metrics for the explanations generated by the system based on the Likert scale of 1 to 5 (1 = Strongly Disagree, 2 = Somewhat Disagree, 3 = Neutral, 4 = Agree Somewhat, and 5 = Strongly Agree). We categorize the people into two groups based on their expertise with machine learning. People with a machine learning background and familiarity with its training and testing process are considered among the expert user group. In contrast, the others are placed under the non-expert group. In the initial phase, the users are introduced to the reasoning and training process of the model. Input images and the explanations generated by the model are shared with the users. When the model is in the testing phase and evaluated against the accuracy metric, input testing images and resultant explanations are again shown to the users. They are asked to evaluate each explanation using the earlier metrics for explanation satisfaction. In the second phase, only the test images are shown to the users of both groups without sharing any further information about the model training and reasoning process.

Before commencing the survey, we informed the participants about the study and publication of the inferred results. We obtained a signed consent form from participants who willingly agreed to participate. Additionally, all individuals involved in the study are 20 or older. The study period extended from October 9th, 2023, to October 13th, 2023, encompassing a five-day duration.

4 Results and discussion

In our first experiment, we elaborate on why we employed a two-stream system despite the motivation of dual-process theory. As the cross-attentional Concept Memory Network (System-II) is the core of our proposed architecture, we switched ON and OFF our sCNN (System-I) to show the change in the performance of this architecture. For concept mining, we use images after removing their backgrounds using U-Net. ResNet34-inspired shallow CNN model is used to extract feature vectors using system-I. From the details of this experiment shown in Table 3, it is quite evident that the two-stream system has a remarkable advantage over the single stream-based system.

[Figure omitted. See PDF.]

We compare the accuracy of our proposed model (CA-SoFTNet) with non-interpretable baseline (i.e., basic CNN model with eight (08) convolutional layers), ProtoPNet [50], ProtoPShare [51] and SoftNet [43]. Table 4 shows the performance comparison of these stated methods with our proposed model. For a fair comparison, all models are trained on these datasets (images without background). Results in Table 4 show that our proposed model has outperformed these models on CUB, ISIC 2016, and ISIC 2017, whereas ProtoPShare has the highest accuracy on Cars Dataset. We believe that this performance gain of our proposed model is due to a dual-process inspired architecture, which processes low-level features along with high-level concepts to achieve a performance gain as shown in Table 4. In Figs 9 and 10, a concept-level illustration of the results of ProtoPNet and SoftNet models with our proposed model is shown with an example input image. Although concept-based and controllable, the explanations provided by SoftNet [43] can sometimes lack relevance due to the separation of concepts for each class. In contrast, the conceptual explanations provided by ProtoPNet and ProtoPShare are not controllable since the model internally learns them during training. Furthermore, these explanations frequently include irrelevant concepts, limiting their usefulness.

[Figure omitted. See PDF.]

First column shows the attended input features by each model, whereas second and third column depicts the top 2 most relevant concepts considered by our model and SoftNet. Last column indicates input query image on the left side, whereas denoting corresponding learned features on the right.

[Figure omitted. See PDF.]

The first column shows the attended input features by each model, whereas the second and third columns depict the top 2 most relevant concepts considered by our model and SoftNet. The last column indicates the input query image on the left side, denoting corresponding learned features on the right.

4.1 Comparison with related models

As our model is an attention-based interpretable model, we compare it with some object-level, part-level, and concept-level attention models, as well as some other attention models such as CAP [96], MMAL-NET [97], and FFVT [98]. The results of this comparison are shown in Table 5, where “bbox” indicates that the model was trained by cropping the input image using the given bounding box, “anno” tells that model was trained using keypoint annotations, “bbox+anno” means using both earlier stated methods and “full” means that model was trained and tested using full (without cropping or using keypoint annotations) images. These results indicate that our proposed achieved comparable accuracy with non-interpretable and some Part-level attention models but surpassed most of the part-level attention and all the object-level attention models.

[Figure omitted. See PDF.]

4.2 Concept clustering

The concept clustering results are illustrated in Fig 11, providing empirical evidence of the cognitive and deep learning processes within the proposed model. The top section of the figure displays the clusters and their constituent concepts, while the bottom section highlights the selected centroids of each cluster. The K-means clustering algorithm effectively groups similar concepts based on visual similarity, regardless of whether they belong to the same image class or different classes. For instance, Cluster 149 comprises concepts like the beak and head of a brown pelican, all from the same ‘brown pelican’ class, demonstrating strong semantic similarity. On the other hand, Clusters 49, 78, and 110 contain concepts from different classes that are semantically similar. For example, in fine-grained classification or in identifying types of skin cancer, different classes may share characteristics like texture or color.

[Figure omitted. See PDF.]

This figure depicts the clustering of similar concepts among various classes, thus inducing human-like cognition in the proposed model’s reasoning process.

Each cluster contains visually similar concepts. We compute the similarity distance from the centroid for each concept and select the one with the least distance as the representative centroid. This procedure identifies the most salient concept within each cluster, particularly useful when clusters contain concepts from multiple classes. For example, Cluster 78 captures the red-orange body color pattern shared between the painted bunting and vermilion flycatcher classes, aligning with how humans recognize similarities. Similarly, Cluster 110 highlights the symmetry in black head and beak attributes between the brewer blackbird and pigeon guillemot classes.

This clustering technique allows the model to emulate human-like cognitive processes, enhancing the accuracy of image classification tasks. This capability is crucial for bridging the gap between the model’s inferences and human intuition, facilitating the provision of high-level interpretable explanations.

4.3 Justified trust

We evaluated the Justifiable Trust of our proposed model from both expert and non-expert groups. Whether a person can reliably anticipate the model’s output for a given input image is determined. The results of these experiments are shown in Table 6, which indicate that the values of our proposed model are better than some of the latest and most renowned interpretable models.

[Figure omitted. See PDF.]

4.4 Explanation satisfaction

For explanation satisfaction, we used the same groups (expert and non-expert groups) mentioned in section 5.2. We asked them to rate the generated explanations based on the metrics defined in section 3.4 using the Likert scale from 1 to 5. We showed the input image and the generated explanations to the evaluators and asked them to rate them. We have conducted the experiments on the CUB-200 2011 dataset with half of the samples used without removing backgrounds (full images) and the other half of samples with background removed (object extraction using the mechanism defined in section 3.3. Table 7 shows the results of these experiments.

[Figure omitted. See PDF.]

5 Conclusion

Our proposed model, the Cross-Attention Slow/Fast Thinking Network (CA-SoftNet), is a user-controlled, inherently interpretable classification model inspired by the dual-process theory of human cognition. This model adopts a two-stream approach for image classification. System-I processes low-level features, while System-II handles high-level features, enabling conceptual-level reasoning based on selected high-level concepts.

The model introduces a novel concept extraction process that automatically identifies significant concepts while eliminating irrelevant backgrounds through unsupervised object segmentation. This capability is particularly beneficial in domains like medical imaging, where labeled concepts or pre-trained models are often unavailable. Moreover, our two-step concept selection process allows user control, emphasizing the most salient concepts for each class. This approach also leverages concept-level similarities among visually similar classes, reducing the number of concepts needed to represent each class. As a result, the model’s scalability is enhanced for larger datasets.

Compared to other concept-based interpretable deep learning models, which mainly provide global explanations or quantify concept importance for overall predictions, CA-SoftNet uses these concepts for model prediction. The model assigns attention scores to input image concepts based on their relevance to the selected high-level concepts and utilizes these scores for predictions. Additionally, it offers local high-level explanations based on the prior selected concepts. The primary advantage of our method over post-hoc approaches is that it provides explanations in a manner that is both understandable and desirable to users. In contrast, post-hoc methods often require human introspection to interpret the model’s underlying inference mechanism.

However, despite these advantages, our model has certain limitations. The reliance on human intervention for concept selection necessitates a certain level of expertise, particularly in fields like medicine, which could limit its practical usability. Additionally, the sophisticated mechanisms for concept extraction and cross-attentional processing may result in longer training times than more straightforward models.

We evaluated the proposed model on the CUB 200-2011, Stanford Cars, ISIC 2016, and ISIC 2017 datasets to demonstrate its diversity and superiority over several state-of-the-art interpretable methods reported in the literature. The model achieved results comparable to many non-interpretable methods, with accuracies of 85.6%, 83.7%, 93.6%, and 90.3% on the CUB 200-2011, Stanford Cars, ISIC 2016, and ISIC 2017 datasets, respectively. Furthermore, the model’s ability to provide high-level local explanations based on concepts presents wide-ranging opportunities to enhance transparency, accountability, and trust in various domains. These include decision support systems, medical image analysis, and autonomous vehicles, where deploying deep learning models is both inevitable and crucial.

References

1. 1. Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., et al. Imagenet large scale visual recognition challenge. International Journal Of Computer Vision. 115 pp. 211–252 (2015).

* View Article

* Google Scholar

2. 2. Dollár, P., Wojek, C., Schiele, B. & Perona, P. Pedestrian detection: A benchmark. 2009 IEEE Conference On Computer Vision And Pattern Recognition. pp. 304–311 (2009).

3. 3. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. The cityscapes dataset for semantic urban scene understanding. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 3213–3223 (2016).

4. 4. Li P., Fei Q., Chen Z. & Liu X. Interpretable Multi-Channel Capsule Network for Human Motion Recognition. Electronics. 12, 4313 (2023).

* View Article

* Google Scholar

5. 5. Litjens G., Kooi T., Bejnordi B., Setio A., Ciompi F., Ghafoorian M., et al. A survey on deep learning in medical image analysis. Medical Image Analysis. 42 pp. 60–88 (2017). pmid:28778026

* View Article

* PubMed/NCBI

* Google Scholar

6. 6. Esteva A., Kuprel B., Novoa R., Ko J., Swetter S., Blau H., et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 542, 115–118 (2017). pmid:28117445

* View Article

* PubMed/NCBI

* Google Scholar

7. 7. Pintelas E. & Livieris I. XSC—An eXplainable Image Segmentation and Classification Framework: A Case Study on Skin Cancer. Electronics. 12, 3551 (2023).

* View Article

* Google Scholar

8. 8. Guleria P., Naga Srinivasu P., Ahmed S., Almusallam N. & Alarfaj F. XAI framework for cardiovascular disease prediction using classification techniques. Electronics. 11, 4086 (2022).

* View Article

* Google Scholar

9. 9. Zech, John R and Badgeley, Marcus A and Liu, Manway and Costa, Anthony B and Titano, Joseph J and Oermann, Eric K. “Confounding variables can degrade generalization performance of radiological deep learning models.” arXiv preprint arXiv:1807.00431 (2018).

10. 10. Santa Cruz, Beatriz Garcia and Bossa Matías Nicolás and Sölter Jan and Husch Andreas Dominik. “Public covid-19 x-ray datasets and their impact on model bias–a systematic review of a significant problem.” Medical image analysis 74 (2021): 102225.

* View Article

* Google Scholar

11. 11. Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay, et al. “Wilds: A benchmark of in-the-wild distribution shifts.” International conference on machine learning. PMLR, 2021.

12. 12. Sagawa, Shiori and Raghunathan, Aditi and Koh, Pang Wei and Liang, Percy. “An investigation of why overparameterization exacerbates spurious correlations.” International Conference on Machine Learning. PMLR, 2020.

13. 13. Kawahara Jeremy, et al. “Seven-point checklist and skin lesion classification using multitask multimodal neural nets.” IEEE journal of biomedical and health informatics 23.2 (2018): 538–546.

* View Article

* Google Scholar

14. 14. Naveed Asim and Naqvi Syed S and Khan Tariq M and Razzak Imran. “PCA: Progressive class-wise attention for skin lesions diagnosis.” Engineering Applications of Artificial Intelligence 127 (2024): 107417.

* View Article

* Google Scholar

15. 15. Razzak, M., Naz, S. & Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. Classification In BioApps: Automation Of Decision Making. pp. 323–350 (2018).

16. 16. He K., Gan C., Li Z., Rekik I., Yin Z., Ji W., et al. Transformers in medical image analysis. Intelligent Medicine. 3, 59–78 (2023).

* View Article

* Google Scholar

17. 17. Tian, Y., Pei, K., Jana, S. & Ray, B. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. Proceedings Of The 40th International Conference On Software Engineering. pp. 303–314 (2018).

18. 18. Kraus M. & Feuerriegel S. Decision support from financial disclosures with deep neural networks and transfer learning. Decision Support Systems. 104 pp. 38–48 (2017), https://www.sciencedirect.com/science/article/pii/S0167923617301793.

* View Article

* Google Scholar

19. 19. Antoniadi A., Du Y., Guendouz Y., Wei L., Mazo C., Becker B., et al. Current challenges and future opportunities for XAI in machine learning-based clinical decision support systems: a systematic review. Applied Sciences. 11, 5088 (2021).

* View Article

* Google Scholar

20. 20. Kotsiantis S. Use of machine learning techniques for educational proposes: a decision support system for forecasting students’ grades. Artificial Intelligence Review. 37 pp. 331–344 (2012).

* View Article

* Google Scholar

21. 21. Tjoa E. & Guan C. A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE Transactions On Neural Networks And Learning Systems. 32, 4793–4813 (2020).

* View Article

* Google Scholar

22. 22. Goodman B. & Flaxman S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine. 38, 50–57 (2017).

* View Article

* Google Scholar

23. 23. Hendricks, L., Burns, K., Saenko, K., Darrell, T. & Rohrbach, A. Women also snowboard: Overcoming bias in captioning models. Proceedings Of The European Conference On Computer Vision (ECCV). pp. 771–787 (2018).

24. 24. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. & Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. ArXiv Preprint ArXiv:1811.12231. (2018).

25. 25. Wu, W., Xu, H., Zhong, S., Lyu, M. & King, I. Deep validation: Toward detecting real-world corner cases for deep neural networks. 2019 49th Annual IEEE/IFIP International Conference On Dependable Systems And Networks (DSN). pp. 125–137 (2019).

26. 26. Mahmoudi S., Amel O., Stassin S., Liagre M., Benkedadra M. & Mancas M. A Review and Comparative Study of Explainable Deep Learning Models Applied on Action Recognition in Real Time. Electronics. 12, 2027 (2023).

* View Article

* Google Scholar

27. 27. Chattopadhay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. 2018 IEEE Winter Conference On Applications Of Computer Vision (WACV). pp. 839–847 (2018).

28. 28. Mahendran, A. & Vedaldi, A. Understanding deep image representations by inverting them. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 5188–5196 (2015).

29. 29. Ribeiro, M., Singh, S. & Guestrin, C. Model-agnostic interpretability of machine learning. ArXiv Preprint ArXiv:1606.05386. (2016).

30. 30. Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings Of The IEEE International Conference On Computer Vision. pp. 618–626 (2017).

31. 31. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., et al. Attention is all you need. Advances In Neural Information Processing Systems. 30 (2017).

* View Article

* Google Scholar

32. 32. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., et al. Residual attention network for image classification. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 3156–3164 (2017).

33. 33. Zeiler, M. & Fergus, R. Visualizing and understanding convolutional networks. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. pp. 818–833 (2014).

34. 34. Jaworek-Korjakowska J., Brodzicki A., Cassidy B., Kendrick C. & Yap M. Interpretability of a deep learning based approach for the classification of skin lesions into main anatomic body sites. Cancers. 13, 6048 (2021). pmid:34885158

* View Article

* PubMed/NCBI

* Google Scholar

35. 35. Boumaraf S., Liu X., Wan Y., Zheng Z., Ferkous C., Ma X., et al. Conventional machine learning versus deep learning for magnification dependent histopathological breast cancer image classification: A comparative study with visual explanation. Diagnostics. 11, 528 (2021). pmid:33809611

* View Article

* PubMed/NCBI

* Google Scholar

36. 36. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). International Conference On Machine Learning. pp. 2668–2677 (2018).

37. 37. Gimenez, J., Ghorbani, A. & Zou, J. Knockoffs for the mass: new feature importance statistics with false discovery guarantees. The 22nd International Conference On Artificial Intelligence And Statistics. pp. 2125–2133 (2019).

38. 38. Adebayo J., Gilmer J., Muelly M., Goodfellow I., Hardt M. & Kim B. Sanity checks for saliency maps. Advances In Neural Information Processing Systems. 31 (2018).

* View Article

* Google Scholar

39. 39. Liu, W., Rabinovich, A. & Berg, A. Parsenet: Looking wider to see better. ArXiv Preprint ArXiv:1506.04579. (2015).

40. 40. Wang F. & Rudin C. Falling rule lists. Artificial Intelligence And Statistics. pp. 1013–1022 (2015).

* View Article

* Google Scholar

41. 41. Wei X., Yang Q., Gong Y., Ahuja N. & Yang M. Superpixel hierarchy. IEEE Transactions On Image Processing. 27, 4838–4849 (2018). pmid:29969395

* View Article

* PubMed/NCBI

* Google Scholar

42. 42. Zhang, R., Isola, P., Efros, A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 586–595 (2018).

43. 43. Zia T., Bashir N., Ullah M. & Murtaza S. SoFTNet: A concept-controlled deep learning architecture for interpretable image classification. Knowledge-Based Systems. 240 pp. 108066 (2022).

* View Article

* Google Scholar

44. 44. Daniel, K. Thinking, fast and slow. (2017).

45. 45. Bengio Y. From system 1 deep learning to system 2 deep learning. Neural Information Processing Systems. (2019).

* View Article

* Google Scholar

46. 46. Yang, Y., Bilen, H., Zou, Q., Cheung, W. & Ji, X. Learning foreground-background segmentation from improved layered GANs. Proceedings Of The IEEE/CVF Winter Conference On Applications Of Computer Vision. pp. 2524–2533 (2022).

47. 47. Fang, Z., Kuang, K., Lin, Y., Wu, F. & Yao, Y. Concept-based explanation for fine-grained images and its application in infectious keratitis classification. Proceedings Of The 28th ACM International Conference On Multimedia. pp. 700–708 (2020).

48. 48. Sun, M., Yuan, Y., Zhou, F. & Ding, E. Multi-attention multi-class constraint for fine-grained image recognition. Proceedings Of The European Conference On Computer Vision (ECCV). pp. 805–821 (2018).

49. 49. Papernot, N. & McDaniel, P. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. ArXiv Preprint ArXiv:1803.04765. (2018).

50. 50. Chen C., Li O., Tao D., Barnett A., Rudin C. & Su J. This looks like that: deep learning for interpretable image recognition. Advances In Neural Information Processing Systems. 32 (2019).

* View Article

* Google Scholar

51. 51. Rymarczyk, D., Struski, Ł., Tabor, J. & Zieliński, B. Protopshare: Prototypical parts sharing for similarity discovery in interpretable image classification. Proceedings Of The 27th ACM SIGKDD Conference On Knowledge Discovery & Data Mining. pp. 1420–1430 (2021).

52. 52. Nauta, M., Van Bree, R. & Seifert, C. Neural prototype trees for interpretable fine-grained image recognition. Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition. pp. 14933–14943 (2021).

53. 53. Rymarczyk, D., Struski, Ł., Górszczak, M., Lewandowska, K., Tabor, J. & Zieliński, B. Interpretable image classification with differentiable prototypes assignment. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. pp. 351–368 (2022).

54. 54. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 2921–2929 (2016).

55. 55. Gao Y., Liu J., Li W., Hou M., Li Y. & Zhao H. Augmented Grad-CAM++: Super-Resolution Saliency Maps for Visual Interpretation of Deep Neural Network. Electronics. 12, 4846 (2023).

* View Article

* Google Scholar

56. 56. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. ArXiv Preprint ArXiv:1312.6034. (2013).

57. 57. Bach S., Binder A., Montavon G., Klauschen F., Müller K. & Samek W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One. 10, e0130140 (2015). pmid:26161953

* View Article

* PubMed/NCBI

* Google Scholar

58. 58. Lee, H., Grosse, R., Ranganath, R. & Ng, A. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proceedings Of The 26th Annual International Conference On Machine Learning. pp. 609–616 (2009).

59. 59. Nguyen A., Dosovitskiy A., Yosinski J., Brox T. & Clune J. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances In Neural Information Processing Systems. 29 (2016).

* View Article

* Google Scholar

60. 60. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. & Lipson, H. Understanding neural networks through deep visualization. ArXiv Preprint ArXiv:1506.06579. (2015).

61. 61. Ribeiro, M., Singh, S. & Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. Proceedings Of The 22nd ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 1135–1144 (2016).

62. 62. Abdullah T., Zahid M., Ali W. & Hassan S. B-LIME: An Improvement of LIME for Interpretable Deep Learning Classification of Cardiac Arrhythmia from ECG Signals. Processes. 11, 595 (2023).

* View Article

* Google Scholar

63. 63. Lundberg S. & Lee S. A unified approach to interpreting model predictions. Advances In Neural Information Processing Systems. 30 (2017).

* View Article

* Google Scholar

64. 64. Rezk Eman, Eltorki Mohamed, and El-Dakhakhni Wael. “Interpretable skin cancer classification based on incremental domain knowledge learning.” Journal of Healthcare Informatics Research 7.1 (2023): 59–83. pmid:36910915

* View Article

* PubMed/NCBI

* Google Scholar

65. 65. Wang Sutong, et al. “Interpretability-based multimodal convolutional neural networks for skin lesion diagnosis.” IEEE transactions on cybernetics 52.12 (2021): 12623–12637.

* View Article

* Google Scholar

66. 66. Ghorbani A., Wexler J., Zou J. & Kim B. Towards automatic concept-based explanations. Advances In Neural Information Processing Systems. 32 (2019).

* View Article

* Google Scholar

67. 67. Chen Z., Bei Y. & Rudin C. Concept whitening for interpretable image recognition. Nature Machine Intelligence. 2, 772–782 (2020).

* View Article

* Google Scholar

68. 68. Zhou, Bolei and Sun, Yiyou and Bau, David and Torralba, Antonio. “Interpretable basis decomposition for visual explanation.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

69. 69. Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. 1, 206–215 (2019). pmid:35603010

* View Article

* PubMed/NCBI

* Google Scholar

70. 70. Hou, Junlin, Jilan Xu, and Hao Chen. “Concept-Attention Whitening for Interpretable Skin Lesion Diagnosis.” arXiv preprint arXiv:2404.05997 (2024).

71. 71. Koh, Pang Wei and Nguyen, Thao and Tang, Yew Siang and Mussmann, Stephen and Pierson, Emma and Kim, Been and Liang, Percy. “Concept bottleneck models.” International conference on machine learning. PMLR, 2020.

72. 72. Wang, Bowen and Li, Liangzhi and Nakashima, Yuta and Nagahara, Hajime. “Learning bottleneck concepts in image classification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

73. 73. Yan, An and Wang, Yu and Zhong, Yiwu and He, Zexue and Karypis, Petros and Wang, Zihan, et al. “Robust and interpretable medical image classifiers via concept bottleneck models.” arXiv preprint arXiv:2310.03182 (2023).

74. 74. Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R. & Naik, N. Pairwise confusion for fine-grained visual classification. Proceedings Of The European Conference On Computer Vision (ECCV). pp. 70–86 (2018).

75. 75. Hanselmann, H. & Ney, H. Elope: Fine-grained visual classification with efficient localization, pooling and embedding. Proceedings Of The IEEE/CVF Winter Conference On Applications Of Computer Vision. pp. 1247–1256 (2020).

76. 76. Maji, S., Rahtu, E., Kannala, J., Blaschko, M. & Vedaldi, A. Fine-grained visual classification of aircraft. ArXiv Preprint ArXiv:1306.5151. (2013).

77. 77. Du R., Xie J., Ma Z., Chang D., Song Y. & Guo J. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Transactions On Pattern Analysis And Machine Intelligence. 44, 9521–9535 (2021). pmid:34752385

* View Article

* PubMed/NCBI

* Google Scholar

78. 78. Huang, Z. & Li, Y. Interpretable and accurate fine-grained recognition via region grouping. Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition. pp. 8662–8672 (2020).

79. 79. Huang Y., Zhang J., Liu R. & Zhao S. Improving Accuracy and Interpretability of CNN-Based Fault Diagnosis through an Attention Mechanism. Processes. 11, 3233 (2023).

* View Article

* Google Scholar

80. 80. Li, O., Liu, H., Chen, C. & Rudin, C. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. Proceedings Of The AAAI Conference On Artificial Intelligence. 32 (2018).

81. 81. Nauta, Meike and Schlötterer, Jörg and Van Keulen, Maurice and Seifert, Christin. “Pip-net: Patch-based intuitive prototypes for interpretable image classification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

82. 82. Kahneman D., Frederick S. & Others Representativeness revisited: Attribute substitution in intuitive judgment. Heuristics And Biases: The Psychology Of Intuitive Judgment. 49, 74 (2002).

* View Article

* Google Scholar

83. 83. Chen, D., Bai, Y., Zhao, W., Ament, S., Gregoire, J. & Gomes, C. Deep reasoning networks for unsupervised pattern de-mixing with constraint reasoning. International Conference On Machine Learning. pp. 1500-1509 (2020).

84. 84. Miech, A., Alayrac, J., Laptev, I., Sivic, J. & Zisserman, A. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition. pp. 9826–9836 (2021).

85. 85. Anthony T., Tian Z. & Barber D. Thinking fast and slow with deep learning and tree search. Advances In Neural Information Processing Systems. 30 (2017).

* View Article

* Google Scholar

86. 86. Achanta R., Shaji A., Smith K., Lucchi A., Fua P. & Süsstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions On Pattern Analysis And Machine Intelligence. 34, 2274–2282 (2012). pmid:22641706

* View Article

* PubMed/NCBI

* Google Scholar

87. 87. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 770–778 (2016).

88. 88. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. ArXiv Preprint ArXiv:1409.1556. (2014).

89. 89. Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The caltech-ucsd birds-200-2011 dataset. (California Institute of Technology,2011).

90. 90. Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3d object representations for fine-grained categorization. Proceedings Of The IEEE International Conference On Computer Vision Workshops. pp. 554–561 (2013).

91. 91. Gutman, D., Codella, N., Celebi, E., Helba, B., Marchetti, M., Mishra, N., et al. Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC). ArXiv Preprint ArXiv:1605.01397. (2016).

92. 92. Codella, N., Gutman, D., Celebi, M., Helba, B., Marchetti, M., Dusza, S., et al., Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). 2018 IEEE 15th International Symposium On Biomedical Imaging (ISBI 2018). pp. 168–172 (2018).

93. 93. Ibrahim A., Salem M. & Ali H. Automatic quick-shift segmentation for color images. International Journal Of Computer Science Issues (IJCSI). 11, 122 (2014).

* View Article

* Google Scholar

94. 94. Liu, M., Tuzel, O., Ramalingam, S. & Chellappa, R. Entropy rate superpixel segmentation. CVPR 2011. pp. 2097–2104 (2011).

95. 95. Felzenszwalb P. & Huttenlocher D. Efficient graph-based image segmentation. International Journal Of Computer Vision. 59 pp. 167–181 (2004).

* View Article

* Google Scholar

96. 96. Behera, A., Wharton, Z., Hewage, P. & Bera, A. Context-aware attentional pooling (cap) for fine-grained visual classification. Proceedings Of The AAAI Conference On Artificial Intelligence. 35, 929–937 (2021).

97. 97. Zhang, F., Li, M., Zhai, G. & Liu, Y. Multi-branch and multi-scale attention learning for fine-grained visual categorization. MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I 27. pp. 136–147 (2021).

98. 98. Wang, J., Yu, X. & Gao, Y. Feature fusion vision transformer for fine-grained visual categorization. ArXiv Preprint ArXiv:2107.02341. (2021).

99. 99. Lin, T., RoyChowdhury, A. & Maji, S. Bilinear CNN models for fine-grained visual recognition. Proceedings Of The IEEE International Conference On Computer Vision. pp. 1449–1457 (2015).

100. 100. Zhang, N., Donahue, J., Girshick, R. & Darrell, T. Part-based R-CNNs for fine-grained category detection. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. pp. 834–849 (2014).

101. 101. Huang, S., Xu, Z., Tao, D. & Zhang, Y. Part-stacked CNN for fine-grained visual categorization. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 1173–1182 (2016).

102. 102. Branson, S., Van Horn, G., Belongie, S. & Perona, P. Bird species categorization using pose normalized deep convolutional nets. ArXiv Preprint ArXiv:1406.2952. (2014).

103. 103. Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., et al. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 1143–1152 (2016).

104. 104. Krause, J., Jin, H., Yang, J. & Fei-Fei, L. Fine-grained recognition without part annotations. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 5546–5555 (2015).

105. 105. Wang, D., Shen, Z., Shao, J., Zhang, W., Xue, X. & Zhang, Z. Multiple granularity descriptors for fine-grained categorization. Proceedings Of The IEEE International Conference On Computer Vision. pp. 2399–2406 (2015).

106. 106. Jaderberg M., Simonyan K., Zisserman A. & Others Spatial transformer networks. Advances In Neural Information Processing Systems. 28 (2015).

* View Article

* Google Scholar

107. 107. Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y. & Zhang, Z. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 842–850 (2015).

108. 108. Liu, X., Xia, T., Wang, J., Yang, Y., Zhou, F. & Lin, Y. Fully convolutional attention networks for fine-grained recognition. ArXiv Preprint ArXiv:1603.06765. (2016).

109. 109. Simon, M. & Rodner, E. Neural activation constellations: Unsupervised part model discovery with convolutional networks. Proceedings Of The IEEE International Conference On Computer Vision. pp. 1143–1151 (2015).

110. 110. Zheng, H., Fu, J., Mei, T. & Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. Proceedings Of The IEEE International Conference On Computer Vision. pp. 5209–5217 (2017).

111. 111. Fu, J., Zheng, H. & Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 4438–4446 (2017).

Citation: Ullah MA, Zia T, Kim J, Kadry S (2024) An inherently interpretable deep learning model for local explanations using visual concepts. PLoS ONE 19(10): e0311879. https://doi.org/10.1371/journal.pone.0311879

About the Authors:

Mirza Ahsan Ullah

Roles: Conceptualization, Data curation, Formal analysis, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

E-mail: [email protected] (MAU); [email protected] (JK)

Affiliations: Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan, Department of Software Engineering, University of Gujrat, Gujrat, Pakistan

ORICD: https://orcid.org/0000-0003-2467-0503

Tehseen Zia

Roles: Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Supervision, Validation, Visualization, Writing – review & editing

Affiliation: Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan

Jungeun Kim

Roles: Data curation, Funding acquisition, Investigation, Resources, Visualization, Writing – review & editing

E-mail: [email protected] (MAU); [email protected] (JK)

Affiliation: Department of Computer Engineering, Inha University, Incheon, Republic of Korea

Seifedine Kadry

Roles: Formal analysis, Investigation, Project administration, Validation, Writing – review & editing

Affiliations: Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon, Department of Applied Data Science, Noroff University College, Kristiansand, Norway

[/RAW_REF_TEXT]

References

1. Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., et al. Imagenet large scale visual recognition challenge. International Journal Of Computer Vision. 115 pp. 211–252 (2015).

2. Dollár, P., Wojek, C., Schiele, B. & Perona, P. Pedestrian detection: A benchmark. 2009 IEEE Conference On Computer Vision And Pattern Recognition. pp. 304–311 (2009).

3. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. The cityscapes dataset for semantic urban scene understanding. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 3213–3223 (2016).

4. Li P., Fei Q., Chen Z. & Liu X. Interpretable Multi-Channel Capsule Network for Human Motion Recognition. Electronics. 12, 4313 (2023).

5. Litjens G., Kooi T., Bejnordi B., Setio A., Ciompi F., Ghafoorian M., et al. A survey on deep learning in medical image analysis. Medical Image Analysis. 42 pp. 60–88 (2017). pmid:28778026

6. Esteva A., Kuprel B., Novoa R., Ko J., Swetter S., Blau H., et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 542, 115–118 (2017). pmid:28117445

7. Pintelas E. & Livieris I. XSC—An eXplainable Image Segmentation and Classification Framework: A Case Study on Skin Cancer. Electronics. 12, 3551 (2023).

8. Guleria P., Naga Srinivasu P., Ahmed S., Almusallam N. & Alarfaj F. XAI framework for cardiovascular disease prediction using classification techniques. Electronics. 11, 4086 (2022).

9. Zech, John R and Badgeley, Marcus A and Liu, Manway and Costa, Anthony B and Titano, Joseph J and Oermann, Eric K. “Confounding variables can degrade generalization performance of radiological deep learning models.” arXiv preprint arXiv:1807.00431 (2018).

10. Santa Cruz, Beatriz Garcia and Bossa Matías Nicolás and Sölter Jan and Husch Andreas Dominik. “Public covid-19 x-ray datasets and their impact on model bias–a systematic review of a significant problem.” Medical image analysis 74 (2021): 102225.

11. Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay, et al. “Wilds: A benchmark of in-the-wild distribution shifts.” International conference on machine learning. PMLR, 2021.

12. Sagawa, Shiori and Raghunathan, Aditi and Koh, Pang Wei and Liang, Percy. “An investigation of why overparameterization exacerbates spurious correlations.” International Conference on Machine Learning. PMLR, 2020.

13. Kawahara Jeremy, et al. “Seven-point checklist and skin lesion classification using multitask multimodal neural nets.” IEEE journal of biomedical and health informatics 23.2 (2018): 538–546.

14. Naveed Asim and Naqvi Syed S and Khan Tariq M and Razzak Imran. “PCA: Progressive class-wise attention for skin lesions diagnosis.” Engineering Applications of Artificial Intelligence 127 (2024): 107417.

15. Razzak, M., Naz, S. & Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. Classification In BioApps: Automation Of Decision Making. pp. 323–350 (2018).

16. He K., Gan C., Li Z., Rekik I., Yin Z., Ji W., et al. Transformers in medical image analysis. Intelligent Medicine. 3, 59–78 (2023).

17. Tian, Y., Pei, K., Jana, S. & Ray, B. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. Proceedings Of The 40th International Conference On Software Engineering. pp. 303–314 (2018).

18. Kraus M. & Feuerriegel S. Decision support from financial disclosures with deep neural networks and transfer learning. Decision Support Systems. 104 pp. 38–48 (2017), https://www.sciencedirect.com/science/article/pii/S0167923617301793.

19. Antoniadi A., Du Y., Guendouz Y., Wei L., Mazo C., Becker B., et al. Current challenges and future opportunities for XAI in machine learning-based clinical decision support systems: a systematic review. Applied Sciences. 11, 5088 (2021).

20. Kotsiantis S. Use of machine learning techniques for educational proposes: a decision support system for forecasting students’ grades. Artificial Intelligence Review. 37 pp. 331–344 (2012).

21. Tjoa E. & Guan C. A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE Transactions On Neural Networks And Learning Systems. 32, 4793–4813 (2020).

22. Goodman B. & Flaxman S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine. 38, 50–57 (2017).

23. Hendricks, L., Burns, K., Saenko, K., Darrell, T. & Rohrbach, A. Women also snowboard: Overcoming bias in captioning models. Proceedings Of The European Conference On Computer Vision (ECCV). pp. 771–787 (2018).

24. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. & Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. ArXiv Preprint ArXiv:1811.12231. (2018).

25. Wu, W., Xu, H., Zhong, S., Lyu, M. & King, I. Deep validation: Toward detecting real-world corner cases for deep neural networks. 2019 49th Annual IEEE/IFIP International Conference On Dependable Systems And Networks (DSN). pp. 125–137 (2019).

26. Mahmoudi S., Amel O., Stassin S., Liagre M., Benkedadra M. & Mancas M. A Review and Comparative Study of Explainable Deep Learning Models Applied on Action Recognition in Real Time. Electronics. 12, 2027 (2023).

27. Chattopadhay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. 2018 IEEE Winter Conference On Applications Of Computer Vision (WACV). pp. 839–847 (2018).

28. Mahendran, A. & Vedaldi, A. Understanding deep image representations by inverting them. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 5188–5196 (2015).

29. Ribeiro, M., Singh, S. & Guestrin, C. Model-agnostic interpretability of machine learning. ArXiv Preprint ArXiv:1606.05386. (2016).

30. Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings Of The IEEE International Conference On Computer Vision. pp. 618–626 (2017).

31. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., et al. Attention is all you need. Advances In Neural Information Processing Systems. 30 (2017).

32. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., et al. Residual attention network for image classification. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 3156–3164 (2017).

33. Zeiler, M. & Fergus, R. Visualizing and understanding convolutional networks. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. pp. 818–833 (2014).

34. Jaworek-Korjakowska J., Brodzicki A., Cassidy B., Kendrick C. & Yap M. Interpretability of a deep learning based approach for the classification of skin lesions into main anatomic body sites. Cancers. 13, 6048 (2021). pmid:34885158

35. Boumaraf S., Liu X., Wan Y., Zheng Z., Ferkous C., Ma X., et al. Conventional machine learning versus deep learning for magnification dependent histopathological breast cancer image classification: A comparative study with visual explanation. Diagnostics. 11, 528 (2021). pmid:33809611

36. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). International Conference On Machine Learning. pp. 2668–2677 (2018).

37. Gimenez, J., Ghorbani, A. & Zou, J. Knockoffs for the mass: new feature importance statistics with false discovery guarantees. The 22nd International Conference On Artificial Intelligence And Statistics. pp. 2125–2133 (2019).

38. Adebayo J., Gilmer J., Muelly M., Goodfellow I., Hardt M. & Kim B. Sanity checks for saliency maps. Advances In Neural Information Processing Systems. 31 (2018).

39. Liu, W., Rabinovich, A. & Berg, A. Parsenet: Looking wider to see better. ArXiv Preprint ArXiv:1506.04579. (2015).

40. Wang F. & Rudin C. Falling rule lists. Artificial Intelligence And Statistics. pp. 1013–1022 (2015).

41. Wei X., Yang Q., Gong Y., Ahuja N. & Yang M. Superpixel hierarchy. IEEE Transactions On Image Processing. 27, 4838–4849 (2018). pmid:29969395

42. Zhang, R., Isola, P., Efros, A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 586–595 (2018).

43. Zia T., Bashir N., Ullah M. & Murtaza S. SoFTNet: A concept-controlled deep learning architecture for interpretable image classification. Knowledge-Based Systems. 240 pp. 108066 (2022).

44. Daniel, K. Thinking, fast and slow. (2017).

45. Bengio Y. From system 1 deep learning to system 2 deep learning. Neural Information Processing Systems. (2019).

46. Yang, Y., Bilen, H., Zou, Q., Cheung, W. & Ji, X. Learning foreground-background segmentation from improved layered GANs. Proceedings Of The IEEE/CVF Winter Conference On Applications Of Computer Vision. pp. 2524–2533 (2022).

47. Fang, Z., Kuang, K., Lin, Y., Wu, F. & Yao, Y. Concept-based explanation for fine-grained images and its application in infectious keratitis classification. Proceedings Of The 28th ACM International Conference On Multimedia. pp. 700–708 (2020).

48. Sun, M., Yuan, Y., Zhou, F. & Ding, E. Multi-attention multi-class constraint for fine-grained image recognition. Proceedings Of The European Conference On Computer Vision (ECCV). pp. 805–821 (2018).

49. Papernot, N. & McDaniel, P. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. ArXiv Preprint ArXiv:1803.04765. (2018).

50. Chen C., Li O., Tao D., Barnett A., Rudin C. & Su J. This looks like that: deep learning for interpretable image recognition. Advances In Neural Information Processing Systems. 32 (2019).

51. Rymarczyk, D., Struski, Ł., Tabor, J. & Zieliński, B. Protopshare: Prototypical parts sharing for similarity discovery in interpretable image classification. Proceedings Of The 27th ACM SIGKDD Conference On Knowledge Discovery & Data Mining. pp. 1420–1430 (2021).

52. Nauta, M., Van Bree, R. & Seifert, C. Neural prototype trees for interpretable fine-grained image recognition. Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition. pp. 14933–14943 (2021).

53. Rymarczyk, D., Struski, Ł., Górszczak, M., Lewandowska, K., Tabor, J. & Zieliński, B. Interpretable image classification with differentiable prototypes assignment. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. pp. 351–368 (2022).

54. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 2921–2929 (2016).

55. Gao Y., Liu J., Li W., Hou M., Li Y. & Zhao H. Augmented Grad-CAM++: Super-Resolution Saliency Maps for Visual Interpretation of Deep Neural Network. Electronics. 12, 4846 (2023).

56. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. ArXiv Preprint ArXiv:1312.6034. (2013).

57. Bach S., Binder A., Montavon G., Klauschen F., Müller K. & Samek W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One. 10, e0130140 (2015). pmid:26161953

58. Lee, H., Grosse, R., Ranganath, R. & Ng, A. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proceedings Of The 26th Annual International Conference On Machine Learning. pp. 609–616 (2009).

59. Nguyen A., Dosovitskiy A., Yosinski J., Brox T. & Clune J. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances In Neural Information Processing Systems. 29 (2016).

60. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. & Lipson, H. Understanding neural networks through deep visualization. ArXiv Preprint ArXiv:1506.06579. (2015).

61. Ribeiro, M., Singh, S. & Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. Proceedings Of The 22nd ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 1135–1144 (2016).

62. Abdullah T., Zahid M., Ali W. & Hassan S. B-LIME: An Improvement of LIME for Interpretable Deep Learning Classification of Cardiac Arrhythmia from ECG Signals. Processes. 11, 595 (2023).

63. Lundberg S. & Lee S. A unified approach to interpreting model predictions. Advances In Neural Information Processing Systems. 30 (2017).

64. Rezk Eman, Eltorki Mohamed, and El-Dakhakhni Wael. “Interpretable skin cancer classification based on incremental domain knowledge learning.” Journal of Healthcare Informatics Research 7.1 (2023): 59–83. pmid:36910915

65. Wang Sutong, et al. “Interpretability-based multimodal convolutional neural networks for skin lesion diagnosis.” IEEE transactions on cybernetics 52.12 (2021): 12623–12637.

66. Ghorbani A., Wexler J., Zou J. & Kim B. Towards automatic concept-based explanations. Advances In Neural Information Processing Systems. 32 (2019).

67. Chen Z., Bei Y. & Rudin C. Concept whitening for interpretable image recognition. Nature Machine Intelligence. 2, 772–782 (2020).

68. Zhou, Bolei and Sun, Yiyou and Bau, David and Torralba, Antonio. “Interpretable basis decomposition for visual explanation.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

69. Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. 1, 206–215 (2019). pmid:35603010

70. Hou, Junlin, Jilan Xu, and Hao Chen. “Concept-Attention Whitening for Interpretable Skin Lesion Diagnosis.” arXiv preprint arXiv:2404.05997 (2024).

71. Koh, Pang Wei and Nguyen, Thao and Tang, Yew Siang and Mussmann, Stephen and Pierson, Emma and Kim, Been and Liang, Percy. “Concept bottleneck models.” International conference on machine learning. PMLR, 2020.

72. Wang, Bowen and Li, Liangzhi and Nakashima, Yuta and Nagahara, Hajime. “Learning bottleneck concepts in image classification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

73. Yan, An and Wang, Yu and Zhong, Yiwu and He, Zexue and Karypis, Petros and Wang, Zihan, et al. “Robust and interpretable medical image classifiers via concept bottleneck models.” arXiv preprint arXiv:2310.03182 (2023).

74. Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R. & Naik, N. Pairwise confusion for fine-grained visual classification. Proceedings Of The European Conference On Computer Vision (ECCV). pp. 70–86 (2018).

75. Hanselmann, H. & Ney, H. Elope: Fine-grained visual classification with efficient localization, pooling and embedding. Proceedings Of The IEEE/CVF Winter Conference On Applications Of Computer Vision. pp. 1247–1256 (2020).

76. Maji, S., Rahtu, E., Kannala, J., Blaschko, M. & Vedaldi, A. Fine-grained visual classification of aircraft. ArXiv Preprint ArXiv:1306.5151. (2013).

77. Du R., Xie J., Ma Z., Chang D., Song Y. & Guo J. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Transactions On Pattern Analysis And Machine Intelligence. 44, 9521–9535 (2021). pmid:34752385

78. Huang, Z. & Li, Y. Interpretable and accurate fine-grained recognition via region grouping. Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition. pp. 8662–8672 (2020).

79. Huang Y., Zhang J., Liu R. & Zhao S. Improving Accuracy and Interpretability of CNN-Based Fault Diagnosis through an Attention Mechanism. Processes. 11, 3233 (2023).

80. Li, O., Liu, H., Chen, C. & Rudin, C. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. Proceedings Of The AAAI Conference On Artificial Intelligence. 32 (2018).

81. Nauta, Meike and Schlötterer, Jörg and Van Keulen, Maurice and Seifert, Christin. “Pip-net: Patch-based intuitive prototypes for interpretable image classification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

82. Kahneman D., Frederick S. & Others Representativeness revisited: Attribute substitution in intuitive judgment. Heuristics And Biases: The Psychology Of Intuitive Judgment. 49, 74 (2002).

83. Chen, D., Bai, Y., Zhao, W., Ament, S., Gregoire, J. & Gomes, C. Deep reasoning networks for unsupervised pattern de-mixing with constraint reasoning. International Conference On Machine Learning. pp. 1500-1509 (2020).

84. Miech, A., Alayrac, J., Laptev, I., Sivic, J. & Zisserman, A. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition. pp. 9826–9836 (2021).

85. Anthony T., Tian Z. & Barber D. Thinking fast and slow with deep learning and tree search. Advances In Neural Information Processing Systems. 30 (2017).

86. Achanta R., Shaji A., Smith K., Lucchi A., Fua P. & Süsstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions On Pattern Analysis And Machine Intelligence. 34, 2274–2282 (2012). pmid:22641706

87. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 770–778 (2016).

88. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. ArXiv Preprint ArXiv:1409.1556. (2014).

89. Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The caltech-ucsd birds-200-2011 dataset. (California Institute of Technology,2011).

90. Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3d object representations for fine-grained categorization. Proceedings Of The IEEE International Conference On Computer Vision Workshops. pp. 554–561 (2013).

91. Gutman, D., Codella, N., Celebi, E., Helba, B., Marchetti, M., Mishra, N., et al. Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC). ArXiv Preprint ArXiv:1605.01397. (2016).

92. Codella, N., Gutman, D., Celebi, M., Helba, B., Marchetti, M., Dusza, S., et al., Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). 2018 IEEE 15th International Symposium On Biomedical Imaging (ISBI 2018). pp. 168–172 (2018).

93. Ibrahim A., Salem M. & Ali H. Automatic quick-shift segmentation for color images. International Journal Of Computer Science Issues (IJCSI). 11, 122 (2014).

94. Liu, M., Tuzel, O., Ramalingam, S. & Chellappa, R. Entropy rate superpixel segmentation. CVPR 2011. pp. 2097–2104 (2011).

95. Felzenszwalb P. & Huttenlocher D. Efficient graph-based image segmentation. International Journal Of Computer Vision. 59 pp. 167–181 (2004).

96. Behera, A., Wharton, Z., Hewage, P. & Bera, A. Context-aware attentional pooling (cap) for fine-grained visual classification. Proceedings Of The AAAI Conference On Artificial Intelligence. 35, 929–937 (2021).

97. Zhang, F., Li, M., Zhai, G. & Liu, Y. Multi-branch and multi-scale attention learning for fine-grained visual categorization. MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I 27. pp. 136–147 (2021).

98. Wang, J., Yu, X. & Gao, Y. Feature fusion vision transformer for fine-grained visual categorization. ArXiv Preprint ArXiv:2107.02341. (2021).

99. Lin, T., RoyChowdhury, A. & Maji, S. Bilinear CNN models for fine-grained visual recognition. Proceedings Of The IEEE International Conference On Computer Vision. pp. 1449–1457 (2015).

100. Zhang, N., Donahue, J., Girshick, R. & Darrell, T. Part-based R-CNNs for fine-grained category detection. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. pp. 834–849 (2014).

101. Huang, S., Xu, Z., Tao, D. & Zhang, Y. Part-stacked CNN for fine-grained visual categorization. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 1173–1182 (2016).

102. Branson, S., Van Horn, G., Belongie, S. & Perona, P. Bird species categorization using pose normalized deep convolutional nets. ArXiv Preprint ArXiv:1406.2952. (2014).

103. Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., et al. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 1143–1152 (2016).

104. Krause, J., Jin, H., Yang, J. & Fei-Fei, L. Fine-grained recognition without part annotations. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 5546–5555 (2015).

105. Wang, D., Shen, Z., Shao, J., Zhang, W., Xue, X. & Zhang, Z. Multiple granularity descriptors for fine-grained categorization. Proceedings Of The IEEE International Conference On Computer Vision. pp. 2399–2406 (2015).

106. Jaderberg M., Simonyan K., Zisserman A. & Others Spatial transformer networks. Advances In Neural Information Processing Systems. 28 (2015).

107. Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y. & Zhang, Z. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 842–850 (2015).

108. Liu, X., Xia, T., Wang, J., Yang, Y., Zhou, F. & Lin, Y. Fully convolutional attention networks for fine-grained recognition. ArXiv Preprint ArXiv:1603.06765. (2016).

109. Simon, M. & Rodner, E. Neural activation constellations: Unsupervised part model discovery with convolutional networks. Proceedings Of The IEEE International Conference On Computer Vision. pp. 1143–1151 (2015).

110. Zheng, H., Fu, J., Mei, T. & Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. Proceedings Of The IEEE International Conference On Computer Vision. pp. 5209–5217 (2017).

111. Fu, J., Zheng, H. & Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 4438–4446 (2017).

Word count: 14917

Show less

© 2024 Ullah et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Over the past decade, deep learning has become the leading approach for various computer vision tasks and decision support systems. However, the opaque nature of deep learning models raises significant concerns about their fairness, reliability, and the underlying inferences they make. Many existing methods attempt to approximate the relationship between low-level input features and outcomes. However, humans tend to understand and reason based on high-level concepts rather than low-level input features. To bridge this gap, several concept-based interpretable methods have been developed. Most of these methods compute the importance of each discovered concept for a specific class. However, they often fail to provide local explanations. Additionally, these approaches typically rely on labeled concepts or learn directly from datasets, leading to the extraction of irrelevant concepts. They also tend to overlook the potential of these concepts to interpret model predictions effectively. This research proposes a two-stream model called the Cross-Attentional Fast/Slow Thinking Network (CA-SoftNet) to address these issues. The model is inspired by dual-process theory and integrates two key components: a shallow convolutional neural network (sCNN) as System-I for rapid, implicit pattern recognition and a cross-attentional concept memory network as System-II for transparent, controllable, and logical reasoning. Our evaluation across diverse datasets demonstrates the model’s competitive accuracy, achieving 85.6%, 83.7%, 93.6%, and 90.3% on CUB 200-2011, Stanford Cars, ISIC 2016, and ISIC 2017, respectively. This performance outperforms existing interpretable models and is comparable to non-interpretable counterparts. Furthermore, our novel concept extraction method facilitates identifying and selecting salient concepts. These concepts are then used to generate concept-based local explanations that align with human thinking. Additionally, the model’s ability to share similar concepts across distinct classes, such as in fine-grained classification, enhances its scalability for large datasets. This feature also induces human-like cognition and reasoning within the proposed framework.

Details

Title

An inherently interpretable deep learning model for local explanations using visual concepts

Author

Mirza, Ahsan Ullah

; Zia, Tehseen; Kim, Jungeun; Kadry, Seifedine

First page

e0311879

Section

Research Article

Publication year

2024

Publication date

Oct 2024

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0311879

ProQuest document ID

3121553934

An inherently interpretable deep learning model for local explanations using visual concepts

Jump to:

Full Text

1 Introduction

2 Related work

2.1 Post-hoc attribution methods

2.2 Concept attribution methods

2.3 Attention-based interpretable models

2.4 Case-based reasoning deep learning models

2.5 Dual process-based models

3 Materials and methods

3.1 Concept mining

3.2 Proposed architecture

3.3 Unsupervised object segmentation model training

3.4 Main model training

3.5 Baseline models

3.6 Datasets and experimental settings

3.7 Concept extraction

3.8 Concept representation

3.9 Qualitative experiments

4 Results and discussion

4.1 Comparison with related models

4.2 Concept clustering

4.3 Justified trust

4.4 Explanation satisfaction

5 Conclusion

References

Abstract

Details

Suggested sources