ADCNet: Anomaly-Driven Cross-Modal Contrastive

Full text

Turn on search term navigation

1. Introduction

Medical report generation (MRG) utilizes advanced algorithms to automatically extract key information from medical images and generate the corresponding reports. These advances not only help clinicians to efficiently review and diagnose medical conditions, but also contribute to the broader automation of the medical imaging workflow. Similarly to the image captioning task, the main approaches in MRG are typically based on the encoder–decoder architecture, using Convolutional Neural Networks (CNNs) to extract visual features and Recurrent Neural Networks (RNNs), or their variants, to generate reports [1,2,3]. Although this method successfully captures the coarse-grained features and basic semantic structures of the images, the generated reports tend to be brief, with limited information coverage, often failing to describe subtle or complex abnormalities present in the images. In recent years, the Transformer model [4], which has gained widespread use in Natural Language Processing (NLP), has been increasingly integrated into MRG models [5,6,7,8]. The Transformers’ ability to capture long-distance dependencies and model rich contextual relationships within data has led to significant improvements in the quality of generated reports. Reports produced by Transformer-based models are typically more detailed and comprehensive, enhancing the clinical value of the generated content. The integration of attention mechanisms, particularly self-attention and cross-attention [3], allows the model to dynamically adjust its focus across different features, enabling a more effective extraction of key information from medical images and improving the accuracy of visual feature representation [6,7]. Furthermore, cross-modal alignment techniques such as cross-attention and gating mechanisms have been developed to promote the alignment of visual and textual modalities, enhancing the semantic consistency between the two. These techniques allow the model to better understand the relationships between the image content and the corresponding text, thereby improving the model’s ability to generate accurate, contextually relevant and coherent medical reports. As a result, with the advancement of these methods, substantial progress has been made in improving the accuracy, completeness, and readability of the reports generated in MRG. These contributions not only help clinicians in the efficient review and diagnosis of medical conditions but also aid in the general automation of the medical imaging workflow, streamlining the healthcare process, and reducing human error.

However, several challenges remain in this field. Due to the inherent nature of medical images and reports, cross-modal feature interaction and alignment in this visual-language task are particularly difficult. First, the intrinsic properties of medical images complicate the task of distinguishing visual features. Zhou et al. [9] pointed out that medical images often have low contrast and blurry boundaries, while Wang et al. [10] highlighted that radiographic images exhibit both intra-class variation, where the same lesion appears differently across individuals, and inter-class similarity, where different types of lesions appear visually similar. As shown in Figure 1a, images of two diseases are displayed. It is evident that these two types of images share certain similarities, making them difficult to distinguish and visually vague. Second, the proportion of lesion areas in medical images and their corresponding textual descriptions is relatively small, making it challenging to extract and learn the features necessary for identifying abnormal lesions. Additionally, in commonly used datasets such as IU X-Ray [11] and MIMIC-CXR [12], the distribution of diseases is highly imbalanced, as illustrated in Figure 1b. This imbalance can cause deep neural network models to prioritize descriptions of normal conditions and overemphasize certain diseases while neglecting others. In clinical practice, missing the diagnosis of diseases or abnormalities can have severe consequences.

Earlier research typically modeled cross-modal interaction using self-attention mechanisms applied to the visual and textual features extracted from the encoder–decoder architecture. However, this approach struggles to fully capture complex cross-modal patterns. Recent studies have shifted focus to contrastive learning strategies, which bring the representations of matching image–text pairs closer together while pushing non-matching pairs further apart. This enables the model to learn stronger cross-modal alignment in a unified feature space, thereby improving the accuracy and consistency of report generation. However, in medical imaging, a single instance may contain multiple types of lesions or abnormalities, making coarse-grained contrastive learning less effective at capturing subtle variations between samples within the same category. Traditional contrastive learning methods typically compare entire images or full text segments as instances, overlooking the importance of local features such as specific lesions or tissue regions. As shown in Figure 1a, the feature heatmap is not clearly distinguishable, and the extracted image features fail to properly attend to different regions. Similarly, the text features exhibit the same issue, with words of little significance, such as prepositions, receiving disproportionate attention. This misalignment in feature representation leads to inaccuracies in graphic similarity. For example, while image 1 and image 3 may appear highly similar, they correspond to different diseases. Therefore, there is a need to develop fine-grained feature matching methods that leverage information from local lesion regions, enhancing the model’s ability to focus on relevant areas and improving the overall quality of report generation.

To address the challenges, we propose an Anomaly-Driven Cross-Modal Contrastive Network (ADCNet), designed to improve the quality and accuracy of medical report generation. This method incorporates abnormal lesion information and uses a cross-modal feature fusion and alignment mechanism to reduce semantic discrepancies between medical images and reports, thereby enhancing the semantic consistency between the two modalities. Additionally, contrastive learning is employed to refine the model’s image understanding. Overall, the proposed method increases the sensitivity of visual features to abnormal regions and applies fine-grained alignment across modal features to improve feature representation. First, we introduce the anomaly-aware cross-modal feature fusion (ACFF) module, which introduces a new anomaly embedding vector. This vector dynamically represents lesion information and is updated throughout training. Combined with visual features, it enhances the model’s ability to capture lesion abnormalities, thus improving the feature fusion process. Second, we propose the fine-grained regional feature alignment (FRFA) module. This module employs an adaptive filtering mechanism to dynamically select relevant visual and textual features, effectively eliminating irrelevant information and background noise. After filtering, the features undergo cross-modal similarity calculations to derive weighted values at both the image region and text token levels. Furthermore, through residual fusion and contrastive learning, the fine-grained feature alignment is strengthened. Experimental results demonstrate that ADCNet outperforms both baseline models and existing state-of-the-art (SOTA) methods on the IU X-Ray and MIMIC-CXR standard datasets, showcasing significant improvements in performance. In summary, the contributions of this paper are as follows:

We propose an anomaly embedding vector that acts as a bridge between the two modalities, effectively capturing and enhancing the complex semantic interactions between visual and textual features and thereby improving the representation of lesion areas in the visual features.
We design a feature-adaptive filtering method that aligns fine-grained local visual regions with more precise textual descriptions, enhancing the semantic consistency between the cross-modal features.
Through extensive experiments, we show that the proposed method outperforms existing approaches in terms of accuracy, completeness, and readability of medical reports.

2. Related Works

2.1. Medical Report Generation

As a visual-language task, medical report generation (MRG) widely adopts the encoder–decoder framework. To address the challenge of extracting abnormal features from lesions in medical images, researchers have enhanced feature capture capabilities from multiple perspectives, such as intra-modal fine-grained feature extraction and cross-modal fusion. The introduction of external knowledge and intermediate variables, such as knowledge graphs [13,14] and structured high-level semantics [15], can assist in aligning and correcting visual features. However, this approach often requires additional prior knowledge and is heavily dependent on external data and disease labels, which are typically scarce in medical report data. To address the semantic gap between medical images and reports, existing work has focused on optimizing cross-modal interactions. Some researchers have designed memory modules to integrate visual and textual information, such as the memory-driven Transformer model [16] and cross-modal memory networks [17], which record the alignment between visual regions and text tokens to facilitate cross-modal fusion. Additionally, methods like CGFTrans [18] improve feature alignment by fusing local and global visual features with medical report embeddings, while HDGAN [19] introduces a multi-level text generator to capture word- and sentence-level features, enhancing the accuracy and fluency of generated reports. While these approaches effectively fuse features from both modalities, they still tend to extract single-modal features independently before fusion, leading to insufficient alignment depth and failing to fully address the fine-grained feature alignment between modalities.

2.2. Contrastive Learning

Contrastive learning enables the model to distinguish between similar and different visual features by extracting visual and semantic information from matching medical image–text pairs. Abnormal areas are highlighted using attention mechanisms, contrastive loss optimization, and other techniques. Methods [10,20] enhance the representation of visual features by applying various contrastive strategies. However, due to the inherent complexity and ambiguity of medical images, fully capturing the semantic content of the image is challenging. Additionally, medical images often come from different perspectives and are multi-labeled, meaning the same instance may contain multiple lesions. This makes measuring similarity difficult and complicates the accurate detection of complex lesion abnormalities, as simply pulling similar features together and pushing different ones apart is insufficient.

Researchers have taken several approaches to optimize this problem. CLIP [21] has shown strong performance in aligning visual and textual representations by leveraging a large-scale contrastive learning approach on diverse datasets. But in MRG task, its general-purpose nature and lack of domain-specific knowledge often limit its ability to accurately capture the intricate details and abnormalities in medical images. XProNet [22] introduces a cross-modal prototype as an intermediate representation between visual and textual features, using a memory matrix to store prototypes. A multi-label contrastive loss is applied, where pairs sharing at least one label are positive, and those with no common label are negative. However, this approach is vulnerable to data imbalance, as it is influenced by label distribution. Additionally, the coarse-grained classification of positive and negative pairs often misclassifies samples with multiple distinct lesions, introducing noise. CECL [23] uses an adaptive clustering network to group image and text features and applies contrastive loss. However, when feature representations are insufficient, the generated cluster labels may be inaccurate, negatively affecting performance. Moreover, an over-reliance on category information can neglect fine-grained features. DCL [13] proposes a dynamic graph construction method that incorporates external knowledge to optimize visual features during training. It applies image–text contrastive loss, bidirectional contrast, and momentum queue optimization to improve cross-modal alignment. By introducing auxiliary information, this method enhances the extraction of fine-grained, discriminative image features, improving image–text contrast. Based on the aforementioned research, this paper leverages the semantic information within the data and uses a high-dimensional anomaly embedding vector to guide cross-modal feature fusion, enhancing the model’s attention to abnormal features. Additionally, before decoding, feature filtering and fine-grained feature alignment are performed, followed by cross-modal interaction to further refine feature representation. These steps collectively improve the alignment between visual and textual modalities, leading to more accurate and consistent medical report generation.

3. Methods

In the MRG task, let $I \in R^{3 \times 224 \times 224}$ represents the medical image, where I has a resolution of $224 \times 224$ pixels and 3 color channels. The corresponding medical report is represented as $R = {r_{1}, r_{2}, \dots, r_{N_{t}}} \in R^{N_{t}}$ , where each $r_{i} \in V$ is the word index from the vocabulary $V$ , with a vocabulary size of $N_{v}$ . The goal of the MRG task is to generate a report $\hat{R} = {{\hat{r}}_{1}, {\hat{r}}_{2}, \dots, {\hat{r}}_{N_{g}}} \in R^{N_{g}}$ based on the provided image, such that it is similar to the ground-truth report R. Here, $N_{t}$ and $N_{g}$ denote the lengths of the real report and the generated report, respectively.

CheXpert [24] can generate diagnostic labels for chest X-ray radiology reports. We use this tool to annotate the diseases in the training set, obtaining 14 disease labels, represented as $l = {c_{1}, c_{2}, \dots, c_{N_{c}} ∣ c_{i} = 0, 1} \in R^{N_{c}}$ , where $N_{c}$ represents the number of disease labels.

The overall structure of the model is based on a Transformer-based encoder–decoder architecture, as shown in Figure 2. It mainly consists of four components: (a) feature extraction; (b) anomaly-aware cross-modal feature fusion; (c) fine-grained regional feature alignment; (d) text generation.

3.1. Feature Extraction

3.1.1. Visual Feature Extraction

The image I is passed through a pre-trained ResNet-101 model to extract patch-level features and obtain local visual feature $V_{p} \in R^{N_{P} \times d_{v}}$ , with $N_{P}$ representing the number of patches and $d_{v}$ representing the visual feature dimension.

3.1.2. Text Feature Extraction

For text sequences, the word embedding module in the Transformer maps discrete word indices into a $d_{m}$ -dimensional continuous vector space, producing an embedding vector T. To stabilize the training process, T is scaled and normalized as follows:

(1) $T = \frac{W_{e m b} R}{\sqrt{d_{m}}},$

where

T \in R^{N_{t} \times d_{m}}

, and

W_{e m b} \in R^{N_{v} \times d_{m}}

is the word embedding matrix, with

N_{v}

as the vocabulary size and

d_{m}

as the embedding dimension.

Next, positional encoding is added to incorporate word order into the embedding. The positional encoding is computed as follows:

(2) $p_{i, 2 k} = sin (\frac{i}{{10,000}^{2 k / d_{m}}}), p_{i, 2 k + 1} = cos (\frac{i}{{10,000}^{2 k / d_{m}}}),$

where

i \in {1, 2, \dots, N_{t}}

is the position index of a word in the sequence, and

k \in {1, 2, \dots, d_{m} / 2}

corresponds to the embedding dimension indices. The positional encoding matrix P is added to the word embedding.

The final text feature $E \in R^{N_{t} \times d_{m}}$ is computed as the element-wise sum of the normalized word embedding T and the positional encoding P:

(3) $E = T + P .$

This entire process can be summarized as follows:

(4) $E = Embedding (R) .$

3.1.3. Feature Enhancement

The standard Transformer encoder is used to encode the features of the visual modality to obtain enhanced local visual feature $E_{v}$ . Similarly, a standard self-attention module is applied to encode the text embeddings, producing the text embedding features $E_{t}$ :

(5) $E_{v} = Encoder (V_{p}), E_{t} = W_{Q} (T^{'}) \cdot Softmax (W_{K} {(T^{'})}^{T} \cdot W_{V} (T^{'})) + T^{'} .$

where

E_{v} \in R^{N_{p} \times d_{m}}

E_{t} \in R^{N_{t} \times d_{m}}

, and Encoder denotes the standard Transformer encoder.

Next, average pooling is performed to obtain the global visual feature $E_{v_{g}} \in R^{d_{m}}$ .

Finally, the global features are passed through a classification layer to obtain the disease label prediction probability $\hat{L}$ :

(6) $\hat{L} = σ (W_{l} \cdot E_{v_{g}}),$

where

\hat{L} \in R^{N_{c}}

W_{l} \in R^{d_{m} \times N_{c}}

, and

σ

represents the sigmoid activation function.

3.2. Anomaly-Aware Cross-Modal Feature Fusion

The anomaly-aware cross-modal feature fusion (ACFF) module is designed to address the challenge of capturing and aligning clinically significant abnormalities across different modalities. By utilizing predicted disease labels to enhance the fusion process, the ACFF module ensures that visual features are dynamically guided by anomaly-specific semantic information. This innovative approach enhances the model’s ability to focus on critical diagnostic regions, leading to more accurate and clinically relevant cross-modal feature alignment. The design details of the module are illustrated in Figure 3.

3.2.1. Anomaly Information Generation

The generation of anomaly-specific information is a key component of the ACFF module. Predicted disease label probabilities $\hat{L}$ are mapped into a specialized embedding space to produce an anomaly embedding vector $L_{e}$ . This vector encodes the semantic meaning and abnormal characteristics associated with the predicted diseases, providing a rich representation of clinically relevant information.

We begin by transforming the predicted disease label $\hat{L} \in R^{N_{c}}$ into a corresponding anomaly embedding vector $L_{e}$ within the text feature space. Before mapping, $\hat{L}$ is converted into a discrete multi-hot vector by applying a thresholding operation, which can be formulated as follows:

(7) ${\tilde{L}}_{i} = \{\begin{matrix} 1, & if {\hat{L}}_{i} \geq τ, \\ 0, & if {\hat{L}}_{i} < τ, \end{matrix}$

where

i \in {1, 2, \dots, N_{c}}

, and

τ

is a predefined threshold parameter and its value is given in Section 4.3. Through this operation, the resulting multi-hot vector

\tilde{L} \in {0, 1}

indicates the potential disease categories associated with the input image.

However, this vector represents a discrete classification result, and directly computing it with continuous feature vectors may lead to information loss. To address this, the classification results are mapped into a continuous feature space using a trainable embedding process, producing the anomaly embedding vector $L_{e}$ . This embedding introduces both semantic information from the text space and anomaly-specific guidance, improving alignment with clinically relevant features.

Next, we integrate the visual features $E_{v}$ with the generated anomaly embeddings. To achieve this, we project the visual features $E_{v}$ into the anomaly embedding space by applying a trainable transformation matrix $W_{l}$ . This step produces a new visual-label vector $V_{l}$ , which represents a cross-modal fusion of visual features and anomaly-aware information.

(8) $\begin{matrix} L_{e} = Embedding (\tilde{L}), \\ V_{l} = W_{l} \cdot E_{v} . \end{matrix}$

Here, $L_{e} \in R^{N_{c} \times d_{m}}$ , and the embedding function $Embedding (\cdot)$ is the same as the embedding module, as shown in Equation (6), which generates a $d_{m}$ -dimensional vector for each disease label $\tilde{L}$ . $L_{e}$ serves as a guiding signal for the model, directing its attention towards regions of interest in the medical image that correspond to the abnormalities. Additionally, $V_{l} \in R^{N_{c} \times d_{m}}$ , and $W_{l} \in R^{N_{c} \times N_{p}}$ is the transformation matrix that projects the visual features $E_{v}$ into the anomaly embedding space.

By mapping discrete disease labels to $L_{e}$ , we incorporate semantic guidance that directs the model’s focus to clinically relevant regions associated with the diagnosed diseases. This process generates a representation of the image enriched with both visual content and anomaly-specific information, forming a solid foundation for feature fusion and alignment.

3.2.2. Anomaly Information Injection

To fuse visual and anomaly-aware features, we employ a cross-attention mechanism that aligns visual features with anomaly-specific information, focusing on diagnostically critical regions.

The first cross-attention operation aligns the visual-label vector $V_{l}$ with the anomaly embedding vector $L_{e}$ , generating the visual-anomaly vector $L_{v a}$ . This step ensures that the anomaly embedding vector provides semantic and anomaly-specific guidance, enabling the model to focus on regions in the visual modality that correspond to specific disease categories. Next, the second cross-attention layer integrates this anomaly-aware information into the visual feature map $E_{v}$ , producing the fused anomaly-aware visual feature $E_{v f}$ . This step enhances the model’s focus on critical diagnostic areas and improves cross-modal feature alignment.

(9) $\begin{matrix} L_{v a} = Softmax (W_{Q} (V_{l}) \cdot W_{K} {(L_{e})}^{T}) \cdot W_{V} (L_{e}) + V_{l}, \\ E_{v l}^{'} = Softmax (W_{Q} (E_{v}) \cdot W_{K} {(L_{v a})}^{T}) \cdot W_{V} (L_{v a}) + E_{v}, \\ E_{v l} = E_{v} + E_{v l}^{'}, \end{matrix}$

where

L_{v a} \in R^{N_{c} \times d_{m}}

, and

W_{Q}

W_{K}

, and

W_{V}

are the query, key, and value matrices for the attention mechanisms

E_{v l}^{'} \in R^{N_{p} \times d_{m}}

and

E_{v l} \in R^{N_{p} \times d_{m}}

The first attention interaction aligns the visual features with disease-related anomaly embeddings, while the second refines the original visual features with this information. Together, these steps enhance the model’s ability to mine diagnostically significant information and improve the accuracy and clinical relevance of the generated reports.

3.2.3. Visual-Anomaly Vector Loss Calculation

To optimize this fusion process, we calculate a loss function based on the cross-entropy between the abnormal embedding vector $L_{e}$ and the mapped visual features $V_{l}$ :

(10) $L_{a} = - \sum_{i = 1}^{N_{c}} [V_{l i} log (σ (L_{e i})) + (1 - V_{l i}) log (1 - σ (L_{e i}))],$

where

σ

represents the sigmoid activation function, and the loss function encourages the model to focus on the correct abnormal regions.

By integrating visual and textual features with a focus on anomalies, this method not only enhances the accuracy of the reports but also ensures that critical diagnostic information, often embedded within abnormalities, is effectively captured and highlighted.

3.3. Fine-Grained Regional Feature Alignment

For paired medical image and report data, visual and textual features typically exhibit semantic consistency, but challenges arise due to data imbalance and the inherent complexity of medical images and reports. Images often suffer from high intra-class variability and low inter-class discriminability, while reports contain complex language structures. These issues hinder effective feature correlation. To address these challenges, we propose fine-grained regional feature alignment, as shown in Figure 4, a novel method that uses dynamic feature weighting and contrastive loss to enhance the alignment of image and text features. This approach selectively emphasizes important visual regions and text tokens, ensuring more precise and discriminative cross-modal feature alignment.

3.3.1. Adaptive Feature Filtering

To emphasize key image regions and text tokens, we propose a feature filtering module that adaptively generates importance weights for visual and textual features using a multilayer perceptron (MLP):

(11) $w_{v s} = MLP (E_{v l}), w_{t s} = MLP (E_{t}), MLP (x) = Softmax (W_{2} \cdot ReLU (W_{1} \cdot x + b_{1}) + b_{2}) .$

Here, $E_{v l} \in R^{N_{p} \times d_{m}}$ and $E_{t} \in R^{N_{t} \times d_{m}}$ represent the input visual and textual features. The MLP is a single-layer feedforward network applied independently to each feature vector (i.e., each row of $E_{v l}$ or $E_{t}$ ). The first layer transforms the input feature vector of size $d_{m}$ to a hidden representation of size $h_{d}$ . This transformation is performed using $W_{1} \in R^{h_{d} \times d_{m}}$ and bias $b_{1} \in R^{h_{d}}$ , followed by a ReLU activation. The second layer projects the hidden representation to a scalar with $W_{2} \in R^{1 \times h_{d}}$ and bias $b_{2} \in R$ . Finally, a Softmax activation is applied to the output, ensuring that the resulting weights $w_{v s} \in R^{N_{p}}$ and $w_{t s} \in R^{N_{t}}$ form a probability distribution (i.e., the weights sum to 1).

These weights dynamically adjust to the input context, highlighting relevant features while suppressing irrelevant ones.

In comparison, self-attention in Transformers considers global relationships across the entire sequence. It learns attention scores that determine how each token or region relates to all others. In contrast, our adaptive weighting is more focused. It generates weights specifically tailored to the importance of features within the context of their respective modality. This design allows for a more refined emphasis on key features, which is particularly beneficial when dealing with complex and imbalanced medical data, where the relevant information may be highly localized in both visual and textual domains.

Next, both visual and textual features are weighted according to the generated weights to obtain the filtered features $E_{v s}$ and $E_{t s}$ :

(12) $E_{v s} = w_{v s} ⊙ E_{v l}, E_{t s} = w_{t s} ⊙ E_{t},$

where ⊙ denotes element-wise multiplication, and

E_{v s} \in R^{N_{p} \times d_{m}}

E_{t s} \in R^{N_{t} \times d_{m}}

3.3.2. Fine-Grained Feature Alignment

Based on the filtered features, the cross-modal similarity between image regions and text tokens is calculated through matrix multiplication, resulting in the similarity matrix S:

(13) $S = E_{v s} \cdot E_{t s}^{T},$

where

S \in R^{N_{p} \times N_{t}}

Next, fine-grained feature correlation is computed based on S. It is normalized and averaged across both the text and visual dimensions to obtain the cross-modal global attention weights for each image region and each token, denoted as $S_{v}$ and $S_{t}$ , respectively:

(14) $\begin{matrix} A_{v} = Softmax (S, \dim = - 1), A_{t} = Softmax (S, \dim = 0), \\ S_{v} = \frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} A_{v_{:, i}}, S_{t} = \frac{1}{N_{p}} \sum_{j = 1}^{N_{p}} A_{t_{j, :}}, \end{matrix}$

where

A_{v} \in R^{N_{p} \times N_{t}}

represents the attention weight of the image region with respect to the text token, and

A_{t} \in R^{N_{p} \times N_{t}}

represents the attention weight of the text token with respect to the image region.

S_{v} \in R^{N_{p} \times 1}

and

S_{t} \in R^{N_{t} \times 1}

. The image and text features are then weighted element-wise to obtain

E_{v a}

and

E_{t a}

(15) $E_{v a} = S_{v} \cdot E_{v s}, E_{t a} = S_{t} \cdot E_{t s},$

where

E_{v a} \in R^{N_{p} \times d_{m}}

E_{t a} \in R^{N_{t} \times d_{m}}

Finally, soft attention weights are used to perform dynamic residual fusion of the attended feature and aligned feature to obtain the regional aligned features $E_{v f}$ and $E_{t f}$ :

(16) $E_{v f} = α \cdot E_{v a} + (1 - α) \cdot E_{v f}, E_{t f} = β \cdot E_{t a} + (1 - β) \cdot E_{t},$

where

α

and

β

are hyperparameters that controls the fusion of the attended feature and the original feature.

This method reduces interference from redundant image regions and irrelevant text tokens by combining local region filtering, feature alignment, and a dynamic residual mechanism. Meanwhile, it enhances the alignment of image and text features through contrastive loss, improving the semantic consistency between cross-modal features.

3.3.3. Fine-Grained Cross-Modal Contrastive Loss Calculation

After obtaining the fine-grained features $E_{v f}$ and $E_{t f}$ , the final step is to compute the contrastive loss. The features are first L2-normalized:

(17) ${\hat{E}}_{v f} = \frac{E_{v f}}{∥ E_{v f} ∥_{2}}, {\hat{E}}_{t f} = \frac{E_{t f}}{∥ E_{t f} ∥_{2}} .$

Then, the similarity for both image-to-text and text-to-image is computed using the temperature coefficient T:

(18) $S_{i t} (i) = \frac{exp ({\hat{E}}_{v f_{i}} \cdot {\hat{E}}_{t f}^{T} / T)}{\sum_{k = 1}^{N_{t}} exp ({\hat{E}}_{v f_{i}} \cdot {\hat{E}}_{t f_{k}}^{T} / T)}, S_{t i} (i) = \frac{exp ({\hat{E}}_{t f_{i}} \cdot {\hat{E}}_{v f}^{T} / T)}{\sum_{k = 1}^{N_{p}} exp ({\hat{E}}_{t f_{i}} \cdot {\hat{E}}_{v f_{k}}^{T} / T)},$

here,

S_{i t} \in R^{N_{p} \times N_{t}}

S_{t i} \in R^{N_{p} \times N_{t}}

and the value of T is given in Section 4.3.

Finally, the contrastive loss is computed by summing both the image-to-text and text-to-image losses:

(19) $L_{c} = - \sum_{i = 1}^{N_{p}} log (S_{i t} (i)) - \sum_{i = 1}^{N_{t}} log (S_{t i} (i)) .$

3.4. Text Generation

This paper uses the standard Transformer decoder as a text generator. In the training phase, medical images, real reports, and real labels are input, while in the validation and testing phases, only the model generates the predicted report text in an autoregressive manner based on the hidden vector. The decoding process is optimized using cross-entropy loss.

The decoder generates each token ${\hat{r}}_{t}$ conditioned on the previously generated tokens ${\hat{r}}_{1}, \dots, {\hat{r}}_{t - 1}$ , the fused visual features $E_{v f}$ , and text features $E_{t}$ . The fusion of these features allows the decoder to incorporate both image and text information when generating the report. The model is trained to maximize the likelihood of generating the target sequence $\hat{R}$ given the input image I, formulated as follows:

(20) $p (\hat{R} | I) = \prod_{t = 1}^{N_{g}} p ({\hat{r}}_{t} | {\hat{r}}_{1}, \dots, {\hat{r}}_{t - 1}, E_{v f}, E_{t}),$

where

\hat{R} = {{\hat{r}}_{1}, {\hat{r}}_{2}, \dots, {\hat{r}}_{N_{g}}}

is the generated report sequence.

At each time step, the decoder generates the next token ${\hat{r}}_{t}$ based on the context provided by the previously generated tokens and the cross-modal features from both the image and the text. This autoregressive process continues until the entire report is generated.

The decoding process is optimized using cross-entropy loss:

(21) $L_{d} = - \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{N_{g}} \sum_{j = 1}^{N_{g}} log p (r_{i j} | {\hat{r}}_{i j}) .$

3.5. Loss Function

In order to effectively integrate the three loss functions and calculate the complete overall loss function, an automatically adjustable weight factor is introduced to help balance the contribution of various loss components to model training. The total loss is calculated as follows:

(22) $L_{t} = λ_{a} L_{a} + λ_{c} L_{c} + λ_{d} L_{d},$

where the values of the weight parameters

λ_{a}

λ_{c}

, and

λ_{d}

are given in Section 4.3.

4. Experiment Setup

4.1. Dataset

The experiments in this paper are conducted on two datasets: IU X-Ray [11] and MIMIC-CXR [12]. Fourteen categories of common chest diseases are annotated using the CheXpert [24] dataset. All reports are standardized to lowercase, and rare words with a frequency of less than 3 are filtered out, resulting in 760 and 7863 unique lexicons for the IU X-Ray and MIMIC-CXR datasets, respectively. Data segmentation is performed following the R2Gen benchmark [16]. The IU X-Ray dataset was split in a 7:1:2 ratio, while MIMIC-CXR adhered to the official division standard.

4.2. Evaluation Indicators

We use a variety of natural language generation (NLG) metrics (BLEU-n [25], METEOR [26], and ROUGE-L [27]) to evaluate the performance of our model. These metrics measure the similarity between generated reports and reference reports in terms of n-gram overlap and semantic similarity. For these evaluation metrics, the higher the score, the better the model performance.

In addition, we evaluate clinical efficacy (CE) metrics, including precision, recall, and F1-score. CE metrics are introduced by Chen et al. [28] as a way to assess the clinical relevance of generated radiology reports. As shown in this paper, we employ CheXpert [24] to extract disease labels from both the reference reports and the generated reports. CE metrics calculate how well the generated reports capture clinically important findings by comparing the predicted labels with ground-truth labels across 14 thoracic disease categories. The CE scores are computed as the average precision (P), recall (R), and F1-score ( $F_{1}$ ) across all disease categories. Higher CE scores indicate a better alignment of the generated reports with clinically relevant information.

4.3. Experimental Parameter Settings

We resize the images to 224 × 224 pixels, apply center cropping, and scale the pixel values to [−1, 1]. ResNet-101 pre-trained on ImageNet [29] is used as the visual feature encoder, which divides each radiology image into 2048-dimensional patch features. A baseline model is constructed using a standard Transformer decoder as the text generator. The number of patches is 98 for the IU X-Ray dataset and 49 for the MIMIC-CXR dataset, respectively. The Transformer architecture uses eight attention heads in the encoder and decoder with a hidden dimension of 512. The model is trained over 50 epochs with a batch size of 16 using the Adam optimizer and cross-entropy loss, implemented in Python 3.7 using PyTorch 1.7.1. The learning rate for the visual extractor is $5 \times 10^{- 5}$ , and the learning rate for other parameters is $1 \times 10^{- 4}$ , with a decay of 0.8 per round. The dropout rate is uniformly set to 0.1 in this study. The initial values for the weights of the loss function $L_{a}$ , $L_{c}$ , and $L_{d}$ are set to 0.4, 0.4, and 0.2, respectively, and are dynamically updated during training. The temperature coefficient T for $L_{c}$ is set to 0.07. The threshold value $τ$ for the labels is set to 0.8. We conduct a parameter sensitivity experiment to determine the optimal value of T, which is detailed in Section 4. The beam search size used during report generation is set to 3. The distribution of loss values and evaluation indicators are shown in Figure 5 and Figure 6, respectively, showing a continuous decline and steady progress.

Our baseline model is built on top of a standard Transformer architecture and comprises two main components: (1) a visual feature extractor that uses a pre-trained ResNet-101 network to extract patch-level image features and (2) a standard Transformer-based encoder–decoder module for cross-modal feature encoding and report generation. This design was inspired by the work of [16], where a memory-driven Transformer was proposed for radiology report generation. However, for the baseline model, we removed the innovative memory-driven module and other specific enhancements proposed in their method, retaining only the core Transformer-based structure for feature encoding and decoding.

5. Experiment Results

5.1. Comparative Experimental Analysis

Table 1 presents a comparative analysis of the performance of ADCNet compared to existing SOTA models on two datasets. The results show that ADCNet outperforms other methods across many evaluation metrics, demonstrating its effectiveness in generating high-quality medical reports. On the IU X-Ray dataset, ADCNet performs better in BLEU-3 and BLEU-4. For the MIMIC-CXR dataset, ADCNet achieves a notable improvement, with an average increase of 3.94% in BLEU scores (from BLEU-1 to BLEU-4). However, the performance of the METEOR and ROUGE-L metrics is lower for both datasets, suggesting that the reports generated by ADCNet exhibit lower surface-level overlap with reference reports. This observation can be explained by the highly subjective nature of medical reports. Unlike more formal or structured domains, medical reports vary significantly in language style, terminology, and sentence structure due to the preferences of the individual physician and the clinical context and practice. Such variability presents a challenge for traditional NLG models, which often aim for high shallow overlap similarity to reference texts. In contrast, the ADCNet approach generates more structured and standardized output, which may not always align closely with the phrasing and sequence patterns seen in the reference reports. This leads to a decrease in the surface-level overlap as measured by ROUGE-L and METEOR. Additionally, while traditional NLG models may focus on achieving high lexical overlap with reference texts, a more robust approach for medical report generation should also consider semantic accuracy and clinical relevance.

Table 2 demonstrates that our method shows improvements in CE metrics for medical report generation. ADCNet achieves a recall of 0.369, outperforming other models. It also performs well in terms of $F_{1}$ , with a score of 0.337, indicating a good balance between precision and recall. As mentioned above, in the context of medical image report generation, recall is particularly important, as the missed detection of lesions can result in significant diagnostic errors. A high recall rate helps to mitigate such risks, confirming the reliability of our method in detecting multiple lesions.

5.2. Ablation Experiment Analysis

As illustrated in Table 3, in the ablation study, we systematically investigated the contributions of the ACFF and FRFA modules to the performance of MRG. In both the IU X-Ray and MIMIC-CXR datasets, the baseline model, devoid of these specialized modules, exhibited relatively lower performance across a range of evaluation metrics. The introduction of the ACFF module marked a significant improvement, particularly in terms of n-gram matching (BLEU scores) and fluency (METEOR score), indicating its effectiveness in enhancing the cross-modal feature fusion process and facilitating the generation of more coherent and accurate medical reports.

Furthermore, the subsequent addition of the FRFA module built upon the improvements achieved with the ACFF module, leading to further enhancements in performance across most metrics. Notably, on the MIMIC-CXR dataset, the FRFA module contributed to reversing the slight decline in ROUGE-L score observed with the ACFF module alone, suggesting its role in improving recall-oriented aspects of the generated reports. The FRFA module’s fine-grained alignment of regional features likely helps in capturing more nuanced differences between images and their corresponding reports, thereby enhancing the semantic consistency and completeness of the generated medical reports. Overall, the combined effect of the ACFF and FRFA modules results in a robust and effective framework for medical report generation that outperforms the baseline model and demonstrates the importance of specialized modules in addressing the unique challenges posed by this task.

5.3. Parameter Analysis

As shown in Figure 7, we analyze the impact of the temperature coefficient T, which influences the sharpness of the similarity distribution during the cross-modal contrastive loss calculation in the FRFA module. The temperature coefficient T plays a critical role in balancing the image–text alignment during training. By varying T across different values, we examine its effect on model performance.

As observed, the model’s performance improves across several evaluation metrics as T increases from 0.01 to 0.07. This trend aligns with the settings used in pre-trained CLIP models, where $T = 0.07$ has been shown to provide a good balance between precision and recall. However, when T is further increased beyond 0.07, the performance begins to degrade. This decline can be attributed to the model’s tendency to over-relax the similarity criteria, resulting in less accurate alignment of fine-grained image–text features. Specifically, overly large values of T cause the model to match a broader set of image–text pairs, including irrelevant ones, which reduces the overall precision. Based on these results, we conclude that $T = 0.07$ is the optimal value for maintaining effective cross-modal feature alignment while ensuring robust performance in the medical report generation task.

5.4. Qualitative Experiments

To qualitatively assess the effectiveness of our ACFF module, we visualized a representative example from the test set of the MIMIC-CXR dataset, as seen in Figure 8. This figure illustrates the heatmap showing the alignment between the visual features encoded by the ACFF module and the words in the generated report. The first column presents the data pair, consisting of the image and the ground-truth report. The upper part of the second column contains the heatmap generated by the encoder of the BASE model, along with the report generated by the BASE model, while the lower part displays the heatmap of the feature map output by the ACFF module and the report generated by the ADCNet method. In the heatmap, the color range from blue to red represents the attention weights, with blue indicating low attention and red indicating high attention. This highlights the degree of correspondence between the visual regions in the image and the words in the report.

By comparing the two generated reports with the original report, we observe that the ADCNet method correctly identifies the presence of “atelectasis” in the image and accurately determines that “pneumothorax” is absent. In contrast, the BASE model misidentifies the lesion as “pleural effusion”. Moreover, the heatmap of the BASE model fails to align correctly with the key visual regions. For instance, the area corresponding to the word “lung” is noticeably misaligned, with most of the highlighted regions appearing at the lower part of the image, suggesting that the BASE model struggles to distinguish between relevant visual features. On the other hand, the heatmap generated by the ACFF module provides a more accurate alignment. For example, the locations corresponding to words like “right lung” and “pleural effusion” are correctly identified, demonstrating that the ACFF module improves visual-semantic alignment.

Figure 9 compares the real medical reports with those generated by the baseline model and our proposed method, ADCNet. As shown, the reports generated by ADCNet exhibit greater similarity to the real reports in terms of length, structure, and fluency. For example, in the first case, the sentences generated by the baseline model contain incomprehensible semantics, such as ’xxxx,’ which makes them difficult to understand. In contrast, ADCNet describes the image content in a more coherent and fluid manner. Moreover, ADCNet provides more comprehensive descriptions of lesions. For instance, in the second row, ADCNet accurately identifies “atelectasis”, which the baseline model fails to detect. Additionally, ADCNet demonstrates higher semantic consistency with the real reports. It not only detects the presence of diseases but also offers more precise and complete descriptions of normal conditions and anatomical structures. In the third row, the real report mentions “hemidiaphragm” and “cardiomegaly”. ADCNet successfully identifies both abnormalities and accurately describes their location and severity, correctly noting “right” and “mild”. In contrast, the baseline model fails to detect these conditions and incorrectly identifies “pleural effusion” on the right side. These results show that ADCNet significantly outperforms the baseline model in both accuracy and detail, improving the overall quality and reliability of the medical reports generated.

6. Conclusions

6.1. Summary of Contributions

In this paper, we propose the anomaly-driven cross-modulal contrast network (ADCNet) for the generation of medical reports. ADCNet improves the fusion of image and text modalities, improving the accuracy of feature representation and generating complete, accurate, and fluent reports. The Anomaly-Aware Cross-Modal Fusion (ACFF) module predicts disease labels from image features and maps them to the text embedding space, helping the model to focus on lesion abnormalities. The fine-grained regional feature alignment (FRFA) module filters image and text features to suppress irrelevant information and aligns them for contrastive loss optimization. The experimental results show the superiority of ADCNet in generating medically accurate and semantically consistent reports.

6.2. Future Work and Challenges

Future work will firstly focus on improving ADCNet’s performance in cases with limited input data, such as the MIMIC-CXR dataset, which consists of single-view X-ray images. Optimization is needed to handle such inputs more effectively. Additionally, while the model shows strong results, there is room for improvement in detecting subtle lesions, as indicated by the clinical efficacy (CE) metrics. Another challenge is the lower performance on ROUGE-L and METEOR metrics, which reflects the model’s focus on clinical relevance and semantic accuracy rather than surface-level overlap. Future research will aim to balance this trade-off, enhancing lexical overlap without compromising clinical precision. Finally, while the model is effective for CT and MRI data of lung diseases, it will be extended to handle a broader range of imaging modalities and disease types, improving its generalizability.

Author Contributions

Methodology, Y.L.; validation, Y.L.; formal analysis, J.Z. and K.L.; investigation, J.Z. and L.T.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and J.Z.; visualization, Y.L.; supervision, J.Z. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The datasets were provided by the IU X-Ray and MIMIC-CXR repositories and were allowed for personal academic research. The specific links to the datasets are https://openi.nlm.nih.gov/ (accessed on 1 January 2015) for IU X-Ray and https://physionet.org/content/mimic-cxr/ (accessed on 19 September 2019) for MIMIC-CXR.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Encoding process of MRG and disease label distribution of two commonly used datasets.

View Image - Figure 2. Overall architecture of the proposed ADCNet model. It consists of feature extraction, anomaly-aware cross-modal feature fusion, fine-grained regional feature alignment, and text generation.

Figure 2. Overall architecture of the proposed ADCNet model. It consists of feature extraction, anomaly-aware cross-modal feature fusion, fine-grained regional feature alignment, and text generation.

Figure 3. Anomaly-aware cross-modal feature fusion module (ACFF) structure diagram.

View Image - Figure 4. Fine-grained regional feature alignment module structure diagram. The upper side indicates the adaptive feature filtering module.

Figure 4. Fine-grained regional feature alignment module structure diagram. The upper side indicates the adaptive feature filtering module.

Figure 5. Loss function line plot for IU X-Ray and MIMIC-CXR datasets.

Figure 6. Evaluation metric boxplot for IU X-Ray and MIMIC-CXR datasets.

Figure 7. Parameter T sensitivity analysis experiment on the IU X-Ray dataset.

Figure 8. Visualization of image–text mapping for visual feature outputs of ACFF module on the MIMIC dataset in the testing scenario.

Figure 9. Comparison of medical report generation between baseline model and the model proposed in this paper.

Table 1

Comparison of NLG experimental results of different methods on two datasets.

Dataset	Model	B-1	B-2	B-3	B-4	METEOR	ROUGE-L
IU X-Ray	R2Gen (2020) [16]	0.470	0.304	0.219	0.165	0.187	0.371
	CA (2021) [6]	0.492	0.314	0.222	0.169	0.193	0.381
	R2GenCMN (2021) [17]	0.475	0.309	0.222	0.170	0.191	0.375
	DCL (2023) [13]	—	—	—	0.163	0.193	0.383
	MvCo-DoT (2023) [30]	0.453	0.318	0.223	0.157	0.196	0.374
	CGFTrans (2024) [18]	0.503	0.318	0.224	0.165	0.206	0.381
	HDGAN (2024) [19]	0.477	0.318	0.226	0.167	0.190	0.417
	CL-FD (2024) [31]	0.479	0.319	0.235	0.179	—	0.380
	S3-Net (2024) [32]	0.499	0.334	0.246	0.172	0.206	0.401
	Ours (ADCNet)	0.493	0.317	0.249	0.179	0.194	0.388
MIMIC-CXR	R2Gen (2020) [16]	0.353	0.218	0.145	0.103	0.142	0.227
	CA (2021) [6]	0.350	0.219	0.152	0.109	0.151	0.283
	R2GenCMN (2021) [17]	0.371	0.233	0.153	0.106	0.143	0.278
	DCL (2023) [13]	—	—	—	0.109	0.150	0.284
	CGFTrans (2024) [18]	0.382	0.235	0.155	0.110	0.154	0.285
	HDGAN (2024) [19]	0.352	0.195	0.122	0.084	0.115	0.290
	AERMNet (2024) [33]	0.273	0.169	0.120	0.090	0.157	0.232
	MCSAM (2024) [34]	0.379	0.230	0.153	0.109	0.149	0.284
	AdaMatch-Cyclic (2024) [35]	0.379	0.234	0.154	0.106	0.163	0.286
	Ours (ADCNet)	0.406	0.246	0.161	0.111	0.156	0.263

The bold values in the table represent the best experimental results for each metric.

Table 2

Comparison of CE experimental results of different methods on MIMIC-CXR datasets.

Model	Precision	Recall	F1
R2Gen (2020) [16]	0.333	0.273	0.276
R2GenCMN (2021) [17]	0.334	0.275	0.278
CA (2021) [6]	0.352	0.298	0.303
R2GENRL (2022) [36]	0.342	0.294	0.292
RAMIT (2023) [37]	0.380	0.342	0.335
METransformers (2023) [38]	0.364	0.309	0.331
Ours (ADCNet)	0.347	0.369	0.337

The bold values in the table represent the best experimental results for each metric.

Table 3

The ablation experiments of the method in this paper are performed on two datasets.

Dataset	Base	ACFF	FRFA	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L
IU X-Ray	✓			0.441	0.265	0.189	0.146	0.170	0.341
	✓	✓		0.488	0.317	0.220	0.170	0.193	0.363
	✓		✓	0.494	0.310	0.222	0.172	0.189	0.389
	✓	✓	✓	0.493	0.317	0.249	0.179	0.194	0.388
MIMIC-CXR	✓			0.299	0.187	0.127	0.092	0.126	0.271
	✓	✓		0.357	0.232	0.156	0.106	0.144	0.261
	✓		✓	0.362	0.238	0.157	0.108	0.146	0.261
	✓	✓	✓	0.406	0.246	0.161	0.111	0.156	0.263

The bold values in the table represent the best experimental results for each metric.

References

1. Yang, Y.; Yu, J.; Zhang, J.; Han, W.; Jiang, H.; Huang, Q. Joint embedding of deep visual and semantic features for medical image report generation. IEEE Trans. Multimed.; 2021; 25, pp. 167-178. [DOI: https://dx.doi.org/10.1109/TMM.2021.3122542]

2. Tang, Q.; Yu, Y.; Feng, X.; Peng, C. Semantic and Visual Enrichment Hierarchical Network for Medical Image Report Generation. Proceedings of the 2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML); Hangzhou, China, 25–27 March 2022; pp. 738-743.

3. Jing, B.; Xie, P.; Xing, E. On the automatic generation of medical imaging reports. arXiv; 2017; arXiv: 1711.08195

4. Vaswani, A. Attention is all you need. arXiv; 2017; arXiv: 1706.03762

5. Zhou, Y.; Huang, L.; Zhou, T.; Fu, H.; Shao, L. Visual-textual attentive semantic consistency for medical report generation. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, BC, Canada, 11–17 October 2021; pp. 3985-3994.

6. Liu, F.; Yin, C.; Wu, X.; Ge, S.; Zou, Y.; Zhang, P.; Sun, X. Contrastive attention for automatic chest X-ray report generation. arXiv; 2021; arXiv: 2106.06965

7. Zhou, S.; Nie, D.; Adeli, E.; Yin, J.; Lian, J.; Shen, D. High-resolution encoder–decoder networks for low-contrast medical image segmentation. IEEE Trans. Image Process.; 2019; 29, pp. 461-475. [DOI: https://dx.doi.org/10.1109/TIP.2019.2919937] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31226074]

8. Wang, L.; Zhang, L.; Shu, X.; Yi, Z. Intra-class consistency and inter-class discrimination feature learning for automatic skin lesion classification. Med. Image Anal.; 2023; 85, 102746. [DOI: https://dx.doi.org/10.1016/j.media.2023.102746]

9. Wang, F.; Zhou, Y.; Wang, S.; Vardhanabhuti, V.; Yu, L. Multi-granularity cross-modal alignment for generalized medical visual representation learning. Adv. Neural Inf. Process. Syst.; 2022; 35, pp. 33536-33549.

10. Song, X.; Zhang, X.; Ji, J.; Liu, Y.; Wei, P. Cross-modal contrastive attention model for medical report generation. Proceedings of the 29th International Conference on Computational Linguistics; Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2388-2397.

11. Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc.; 2016; 23, pp. 304-310. [DOI: https://dx.doi.org/10.1093/jamia/ocv080]

12. Johnson, A.E.; Pollard, T.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Peng, Y.; Lu, Z.; Mark, R.G.; Berkowitz, S.J.; Horng, S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv; 2019; arXiv: 1901.07042

13. Li, M.; Lin, B.; Chen, Z.; Lin, H.; Liang, X.; Chang, X. Dynamic graph enhanced contrastive learning for chest X-ray report generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 3334-3343.

14. Cao, Y.; Cui, L.; Yu, F.; Zhang, L.; Li, Z.; Liu, N.; Xu, Y. Kdtnet: Medical image report generation via knowledge-driven transformer. Proceedings of the International Conference on Database Systems for Advanced Applications; Virtual, 11–14 April 2022; pp. 117-132.

15. Nooralahzadeh, F.; Gonzalez, N.P.; Frauenfelder, T.; Fujimoto, K.; Krauthammer, M. Progressive transformer-based generation of radiology reports. arXiv; 2021; arXiv: 2102.09777

16. Chen, Z.; Song, Y.; Chang, T.H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv; 2020; arXiv: 2010.16056

17. Chen, Z.; Shen, Y.; Song, Y.; Wan, X. Cross-modal memory networks for radiology report generation. arXiv; 2022; arXiv: 2204.13258

18. Xu, L.; Tang, Q.; Zheng, B.; Lv, J.; Li, W.; Zeng, X. CGFTrans: Cross-Modal Global Feature Fusion Transformer for Medical Report Generation. IEEE J. Biomed. Health Inform.; 2024; 28, pp. 5600-5612. [DOI: https://dx.doi.org/10.1109/JBHI.2024.3414413] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38875080]

19. Zhang, J.; Cheng, M.; Cheng, Q.; Shen, X.; Wan, Y.; Zhu, J.; Liu, M. Hierarchical medical image report adversarial generation with hybrid discriminator. Artif. Intell. Med.; 2024; 151, 102846. [DOI: https://dx.doi.org/10.1016/j.artmed.2024.102846]

20. Lin, Z.; Zhang, D.; Shi, D.; Xu, R.; Tao, Q.; Wu, L.; He, M.; Ge, Z. Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation. J. Biomed. Inform.; 2023; 138, 104281. [DOI: https://dx.doi.org/10.1016/j.jbi.2023.104281]

21. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. et al. Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning; Virtual, 18–24 July 2021; pp. 8748-8763.

22. Wang, J.; Bhalerao, A.; He, Y. Cross-modal prototype driven network for radiology report generation. Proceedings of the European Conference on Computer Vision; Tel Aviv, Israel, 23–27 October 2022; pp. 563-579.

23. Liu, F.; Ge, S.; Zou, Y.; Wu, X. Competence-based multimodal curriculum learning for medical report generation. arXiv; 2022; arXiv: 2206.14579

24. Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI Conference on Artificial Intelligence; Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 590-597.

25. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Philadelphia, PA, USA, 6–12 July 2002; pp. 311-318.

26. Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Ann Arbor, MI, USA, 29 June 2005; pp. 65-72.

27. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. Proceedings of the Text Summarization Branches Out; Barcelona, Spain, 25–26 July 2004; pp. 74-81.

28. Nicolson, A.; Dowling, J.; Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med.; 2023; 144, 102633. [DOI: https://dx.doi.org/10.1016/j.artmed.2023.102633]

29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.

30. Wang, R.; Wang, X.; Xu, Z.; Xu, W.; Chen, J.; Lukasiewicz, T. MvCo-DoT: Multi-View Contrastive Domain Transfer Network for Medical Report Generation. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Rhodes Island, Greece, 4–10 June 2023; pp. 1-5.

31. Lyu, C.; Qiu, C.; Han, K.; Li, S.; Sheng, V.S.; Rong, H.; Song, Y.; Liu, Y.; Liu, Z. Automatic medical report generation combining contrastive learning and feature difference. Knowl. Based Syst.; 2024; 305, 112630. [DOI: https://dx.doi.org/10.1016/j.knosys.2024.112630]

32. Pan, R.; Ran, R.; Hu, W.; Zhang, W.; Qin, Q.; Cui, S. S3-Net: A Self-Supervised dual-Stream Network for Radiology Report Generation. IEEE J. Biomed. Health Inform.; 2024; 28, pp. 1448-1459. [DOI: https://dx.doi.org/10.1109/JBHI.2023.3345932]

33. Zeng, X.; Liao, T.; Xu, L.; Wang, Z. AERMNet: Attention-enhanced relational memory network for medical image report generation. Comput. Methods Programs Biomed.; 2024; 244, 107979. [DOI: https://dx.doi.org/10.1016/j.cmpb.2023.107979]

34. Tao, Y.; Ma, L.; Yu, J.; Zhang, H. Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation. IEEE J. Biomed. Health Inform.; 2024; 28, pp. 4145-4156. [DOI: https://dx.doi.org/10.1109/JBHI.2024.3393018]

35. Chen, W.; Shen, L.; Lin, J.; Luo, J.; Li, X.; Yuan, Y. Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics; Bangkok, Thailand, 11–16 August 2024; Volume 1: Long Papers, pp. 9494-9509.

36. Qin, H.; Song, Y. Reinforced cross-modal alignment for radiology report generation. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Dublin, Ireland, 22–27 May 2022; pp. 448-458.

37. Zhang, K.; Jiang, H.; Zhang, J.; Huang, Q.; Fan, J.; Yu, J.; Han, W. Semi-supervised medical report generation via graph-guided hybrid feature consistency. IEEE Trans. Multimed.; 2023; 26, pp. 904-915. [DOI: https://dx.doi.org/10.1109/TMM.2023.3273390]

38. Wang, Z.; Liu, L.; Wang, L.; Zhou, L. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 11558-11567.

Word count: 7977

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Medical report generation has made significant progress in recent years. However, generated reports still suffer from issues such as poor readability, incomplete and inaccurate descriptions of lesions, and challenges in capturing fine-grained abnormalities. The primary obstacles include low image resolution, poor contrast, and substantial cross-modal discrepancies between visual and textual features. To address these challenges, we propose an Anomaly-Driven Cross-Modal Contrastive Network (ADCNet), which aims to enhance the quality and accuracy of medical report generation through effective cross-modal feature fusion and alignment. First, we design an anomaly-aware cross-modal feature fusion (ACFF) module that introduces an anomaly embedding vector to guide the extraction and generation of anomaly-related features from visual representations. This process enhances the capability of visual features to capture lesion-related abnormalities and improves the performance of feature fusion. Second, we propose a fine-grained regional feature alignment (FRFA) module, which dynamically filters visual and textual features to suppress irrelevant information and background noise. This module computes cross-modal relevance to align fine-grained regional features, ensuring improved semantic consistency between images and generated reports. The experimental results from the IU X-Ray and MIMIC-CXR datasets demonstrate that the proposed ADCNet method significantly outperforms existing approaches. Specifically, ADCNet achieves notable improvements in natural language generation metrics, as well as the accuracy, completeness, and fluency of medical report generation.

Details

Title

ADCNet: Anomaly-Driven Cross-Modal Contrastive Network for Medical Report Generation

Author

Liu, Yuxue¹

; Zhang, Junsan¹

; Liu, Kai²

; Tan, Lizhuang³

¹ Shandong Province Key Laboratory of Intelligent Oil & Gas Industrial Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China; [email protected]
² State Key Laboratory of Space Network and Communications, Tsinghua University, Beijing 100084, China; [email protected]; Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
³ Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014, China; [email protected]; Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan 250014, China

First page

532

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics14030532

ProQuest document ID

3165774501

ADCNet: Anomaly-Driven Cross-Modal Contrastive Network for Medical Report Generation

Jump to:

Full text

2. Related Works

2.1. Medical Report Generation

2.2. Contrastive Learning

3.1. Feature Extraction

3.1.1. Visual Feature Extraction

3.1.2. Text Feature Extraction

3.1.3. Feature Enhancement

3.2. Anomaly-Aware Cross-Modal Feature Fusion

3.2.1. Anomaly Information Generation

3.2.2. Anomaly Information Injection

3.2.3. Visual-Anomaly Vector Loss Calculation

3.3. Fine-Grained Regional Feature Alignment

3.3.1. Adaptive Feature Filtering

3.3.2. Fine-Grained Feature Alignment

3.3.3. Fine-Grained Cross-Modal Contrastive Loss Calculation

3.4. Text Generation

3.5. Loss Function

4. Experiment Setup

4.1. Dataset

4.2. Evaluation Indicators

4.3. Experimental Parameter Settings

5. Experiment Results

5.1. Comparative Experimental Analysis

5.2. Ablation Experiment Analysis

5.3. Parameter Analysis

5.4. Qualitative Experiments

6.1. Summary of Contributions

6.2. Future Work and Challenges

Abstract

Details

Suggested sources