Full Text

Turn on search term navigation

Introduction

Medical image segmentation is a crucial but difficult issue in the domain of image analysis for medicine, aiming to delineate anatomic structures and organs with distinct forms, looks, or varying degrees of disease at the pixel scale.^[¹^] Computer-aided techniques for segmentation are capable of automating batch processing more quickly and correctly than the laborious and prone-to-mistake manual segmentation performed by medical professionals.

Traditional medical image segmentation algorithms often rely on manually designed features by experts and make heuristic assumptions.^[²^] Although these methods enhance the performance of automatic segmentation to a certain extent by extracting different types of pixel and region features, their generalization capabilities for complex scenes and different organs or lesions are poor. Convolutional neural network (CNN) approaches have shown remarkable efficacy for medical image segmentation in the past few years, mostly due to the advancements in deep learning of computer vision. Fully convolutional neural network (FCN)^[³^] are among the earliest methods for segmenting images. Because of its outstanding efficiency and scaling, the U-Net framework for segmenting biomedical images developed by Ronnebreger et al.^[⁴^] has progressively come to be used for medical image segmentation. With a top-down encoder route, it creates hierarchical representations of the features of the images; a bottom-up decoder path maps the learned feature representations back to the initial resolution for pixel-by-pixel classification. Several improvements based on U-Net have been included in later work to improve the network's functionality, including incorporating attention mechanisms,^[⁵^] redesigning skip connection paths,^[⁶^] replacing backbone modules,^[⁷^] etc. While CNN-based techniques are exceptionally good at representing local features, they are not as good at modeling distant relationships due to convolutional restrictions, and there is still room for improvement in terms of global feature collection.

Dosovitskiy et al.^[⁸^] developed Vision Transformer (ViT), a multi-head self-attention mechanism that enables the network to efficiently collect distant relationships, as a solution to CNN's drawbacks. The area of medical image segmentation now uses a lot of Transformer-based techniques, which are primarily separated into pure Transformer architectures (such as Swin-Unet,^[⁹^] MissFormer,^[¹⁰^] TransCeption,^[¹¹^] etc.) and CNN-Transformer hybrid architecture (such as TransUNet,^[¹²^] TransFuse,^[¹³^] HiFormer,^[¹⁴^] etc.). Compared with the common problem of weak local representation in pure Transformer models,^[¹⁵^] utilizing CNN's and Transformer's complementary capabilities, the CNN-Transformer combination models medical images’ local and global knowledge.^[¹⁶^] However, there are still certain issues with these hybrid designs, such as poor ability to interact with local-global features and multi-scale and multi-level features are not fully exploited and utilized. To address the above problems, we propose FI-Net, a feature-interaction-based network for medical image segmentation. It effectively interactively fuses local-global features from the CNN-Transformer dual-stream encoder, and fully mines the latent semantic information in medical images through the interaction between multi-scale and multi-level features. Our key contributions are specifically as follows: 1) A novel encoder-decoder network structure called FI-Net, which depends on CNN-Transformer, is suggested for the segmentation of medical images. Within the Transformer branch, a dynamically sparse RBIFormer Block is utilized to collect distant dependencies in medical images, while a dual-stream encoder is employed to collect local details and global context. At the same time, boundary prior knowledge is introduced in the boundary-guided decoder (BGD) to guide the decoding process and enhance boundary learning capabilities to obtain more accurate segmentation results; 2) To improve the preservation of local information and global semantic knowledge in medical images, minimize the interference of unnecessary background noise, and employ channel attention to direct local-global features for effective interactive fusion, an attentional feature fusion module (AFFM) is proposed; 3) A multi-scale feature aggregation module (MFAM) is proposed to mine more semantic details to refine segmentation results by aggregating local information, capturing multi-scale context, and modeling inter-channel relationships; and 4) A multi-level feature bridging module (MFBM) is proposed to assist feature interaction by bridging multi-level features and mask information, and extracting multi-scale feature information in medical images, thereby fully mining potential semantic information.

Related Work

The methods used in medical image segmentation in the past few years, including Transformer-based and CNN-based methods, are presented briefly in this section.

CNN-Based Methods

CNN have been extensively employed in the area of medical image segmentation in recent times. Different from traditional medical image segmentation algorithms, CNN can effectively learn different feature representations and extract prior knowledge from medical image datasets. The seminal U-shaped encoder-decoder network design U-Net was proposed by Ronnebreger et al.^[⁴^] in 2015, drawing inspiration from the fully convolutional network (FCN).^[³^] For medical image segmentation, this is the initial end-to-end network. In the domain of medical image segmentation, it has steadily evolved into a design paradigm because of its outstanding results in comparison to other kinds of models. To provide a gating signal, Oktay et al.^[⁵^] inserted an attention gate into the skip connection of U-Net. This led to the proposal of Attention Unet, which emphasizes the attention of features at various locations in space to capture significant characteristics. Gu et al.^[¹⁷^] introduced CE-Net, whose context extractor module is made up of residual multi-core pooling blocks and dense atrous convolution blocks, to retain spatial details and collect higher-level knowledge. By utilizing deep supervision and full-scale skip links, Huang et al.^[¹⁸^] proposed UNet 3+ provides accurate organ segmentation at different scales. Jha et al.^[¹⁹^] proposed DoubleU-Net, which is a superimposed combination of two U-Net network structures. The first U-Net is used to transfer the pre-trained model for feature transfer, and spatial pyramid pooling is used by the other U-Net network to gather contextual data. The outstanding local feature representation capabilities given by convolution have allowed CNN-based medical image segmentation approaches to advance significantly, but they also restrict the capacity to describe long-range relationships in the latent space in medical images.

Transformer-Based Methods

To address the shortcomings of the CNN approach in global representation, the ViT^[⁸^] was proposed. It models long-range interdependence using the multi-head self-attention (MHSA) method. The earliest model to integrate Transformer with U-Net for medical image segmentation requirements is TransUNet, presented by Chen et al.^[¹²^] By merging the shifting window multi-head self-attention in Swin Transformer, Cao et al.^[⁹^] originally presented the purely Transformer structure Swin-Unet for medical image segmentation. Azad et al.^[¹¹^] suggested a Transformer-only U-shaped network TransCeption that improved feature fusion by using context bridges and inception-like modules in the encoder. In their proposal of a hierarchical encoder-decoder network, MISSFormer, Huang et al.^[¹⁰^] showed the potent capacity to capture additional discriminative relationships and context in medical image segmentation by utilizing an improved Transformer block and a context bridging block. TransFuse proposed by Zhang et al.^[¹³^] implements a shallow network architecture with parallel Transformer-CNN branches to model long-term relationships and underlying details and performs well on multiple medical segmentation datasets. Utilizing a CNN-based encoder and the Swin Transformer module, HiFormer^[¹⁴^] created two multiple-scale representations of features and combined local and global information using a double-layer fusion module. Lin et al.^[²⁰^] proposed convolution, transformer, and operator by rethinking the role of boundary detection in medical image segmentation. Inspired by the above studies, we propose the FI-Net model. It extracts and interacts with features using a dual-stream encoder based on CNN-Transformer, preserves local and global semantic information in medical images to the greatest degree possible, and employs prior knowledge to direct the process of learning in the boundary-guided decoder step-by-step.

Experimental Section

Overall Architecture

Figure 1 displays FI-Net's general design. According to the encoder-decoder architecture, it is U-shaped. The dual-stream encoder, AFFM, MFAM, MFBM, and BGD are the five core components of FI-Net. Parallel CNN and Transformer branches make up the dual-stream encoder. In the Transformer branch, we use a computational and memory-friendly RBIFormer Block to extract global features in medical images. AFFM fully exploits the local-global feature dependencies in medical images through the interactive fusion of dual-branch coding features. MFAM and MFBM respectively use multi-branch strip convolution and multi-level feature interaction to fully mine multi-scale features in medical images. Using boundary previous information as a guide, BGD decodes each layer one by one to get the final forecast. We go into further depth about these techniques in the sections that follow.

[IMAGE OMITTED. SEE PDF]

Dual-Stream Encoder

The Transformer Stream

In Transformer Stream we use the mainstream four-stage pyramid structure. To decrease the input spatial resolution while expanding the number of channels, overlapping patch embeddings are employed in the first stage, while patch merging modules are utilized in the second through fourth stages. Then 2, 2, 8, and 2 RBIFormer blocks are stacked in each stage to perform feature conversion and extraction. The sizes of the generated feature maps ( $g^{i} \left(\right. i = 1 , 2 , 3 , 4 \left.\right)$ ) are $\frac{H}{4} \times \frac{W}{4} \times C$ , $\frac{H}{8} \times \frac{W}{8} \times 2 C$ , $\frac{H}{16} \times \frac{W}{16} \times 4 C$ , and $\frac{H}{32} \times \frac{W}{32} \times 8 C$ respectively. The RBIFormer block consists of a relative position coding unit (RPCU), bi-level routing attention (BRA),^[²¹^] and inverse residual feedforward network (IRFFN),^[²²^] as seen in Figure 2.

[IMAGE OMITTED. SEE PDF]

Given a 2D input feature map $X \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\text{H} \times \text{W} \times \text{C}}$ , we use RPCU containing 3 × 3 depth convolutions to implicitly encode relative position information to provide local relationships and structural information inside the patch, and it is described as:1 $\text{RPCU} \left(\right. X \left.\right) = DWConv \left(\right. X \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; X$

Among them, DWConv(·) represents depth-wise convolution.

Next, we used a dynamic sparse attention BRA implemented through dual-layer routing to achieve more flexible computing allocation and content awareness, making it dynamic and query-aware sparse, thereby reducing the global impact of multi-head self-attention. The huge computational burden and memory usage are caused by token interaction. To partition the 2D input feature map $X \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\text{H} \times \text{W} \times \text{C}}$ into $S \times S$ non-overlapping sections with $\frac{\text{HW}}{S^{2}}$ feature vectors in each, first reshape $X$ to $X^{\text{r}} \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\left(\text{S}\right)^{2} \times \frac{\text{HW}}{\left(\text{S}\right)^{2}} \text{&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;times;C}}$ . Then through linear mapping, you can get query, key, and value tensor $Q , K , V \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\left(\text{S}\right)^{2} \times \frac{\text{HW}}{\left(\text{S}\right)^{2}} \text{&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;times;C}}$ :2 $Q = X^{\text{r}} W^{\text{q}} , K = X^{\text{r}} W^{\text{k}} , V = X^{\text{r}} W^{\text{v}}$

Among them, the relevant projection weights for query, key, and value are $W^{\text{q}} , W^{\text{k}} , W^{\text{v}} \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\text{C&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;times;C}}$ .

Next, by building a directed graph, we determine the attention links between regions. In particular, we apply each region mean on $Q$ and $K$ , respectively, to produce regional-level queries and keys $Q^{\text{r}} , K^{\text{r}} \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\left(\text{S}\right)^{2} \text{&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;times;C}}$ . The adjacency matrix $A^{r} \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\left(\text{S}\right)^{2} \left(\text{&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;times;S}\right)^{2}}$ of the region-to-region affinity network is then obtained by multiplying the matrix between the transposes of $Q^{\text{r}}$ and $K^{\text{r}}$ :3 $A^{\text{r}} = Q^{\text{r}} \left(\left(\right. K^{\text{r}} \left.\right)\right)^{\text{T}}$

The adjacency matrix $A^{\text{r}}$ has entries that quantify the semantic relatedness between two locations. The primary action we take after that is to keep each region's $\text{top&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;hyphen;} k$ closest-connected regions exclusively. In particular, using row-wise $\text{top&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;hyphen;} k$ operators, we create a routing index matrix $I^{\text{r}} \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Nopf;^{\left(\text{S}\right)^{2} \text{&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;times;k}}$ :4 $I^{\text{r}} = \text{ topkIndex } \left(\right. A^{\text{r}} \left.\right)$

Consequently, the indices of the k most pertinent areas in the i-th region are contained in the i-th row, $I^{\text{r}}$ . Applying fine-grained token-to-token attention is possible by using the region-to-region routing index matrix $I^{\text{r}}$ . It will combine all key-value pairs and concentrate on k routing zones indexed by $I_{\left(\right. i , 1 \left.\right)}^{\text{r}} , I_{\left(\right. i , 2 \left.\right)}^{\text{r}} , \hdots , I_{\left(\right. i , \text{k} \left.\right)}^{\text{r}}$ for each query token in zone i. Thus, gather the value tensor and key first, that is:5 $K^{\text{g}} = gather \left(\right. K , I^{\text{r}} \left.\right) , V^{\text{g}} = gather \left(\right. V , I^{\text{r}} \left.\right)$

Among them, the value and key tensor are concentrated $K^{\text{g}} , V^{\text{g}} \in &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;reals;^{\left(\text{S}\right)^{2} \times \frac{\text{kHW}}{\left(\text{S}\right)^{2}} \text{&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;times;C}}$ . Next, we apply attention to $K^{\text{g}} , V^{\text{g}}$ what has been gathered.6 $O = Attention \left(\right. Q , K^{\text{g}} , V^{\text{g}} \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; LCE \left(\right. V \left.\right)$ where LCE(·) represents the local context enhancement term, parameterized using a depthwise convolution with kernel size 5.

Our inverse residual feedforward network (IRFFN), which is composed of an expansion layer, a depth-wise convolution, and a projection layer, outperforms the reverse residual block by shifting the location of the shortcut link, as follows:7 $IRFFN \left(\right. X \left.\right) = Conv \left(\right. &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr; \left(\right. Conv \left(\right. X \left.\right) \left.\right) \left.\right)$ 8 $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr; \left(\right. X \left.\right) = DWConv \left(\right. X \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; X$

Activation layers and batch normalization are omitted. Local details may be extracted with a negligible additional processing cost by using depth-wise convolution. As a result, the RBIFormer block's processing flow may be written as follows:9 $Y_{i} = RPCU \left(\right. X_{i - 1} \left.\right)$ 10 $Z_{i} = BRA \left(\right. LN \left(\right. Y_{i} \left.\right) \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; Y_{i}$ 11 $X_{i} = IRFFN \left(\right. LN \left(\right. Z_{i} \left.\right) \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; Z_{i}$

Among them, layer normalization is represented by LN, while the output characteristics of the RPCU and BRA modules of the i-th block are represented by $Y_{i}$ and $Z_{i}$ , respectively.

The CNN Stream

To capture local feature dependencies, we have opted to use the robust and effective Res2Net^[²³^] as the backbone of the CNN stream. Features map $f^{i} \left(\right. i = 1 , 2 , 3 , 4 \left.\right)$ with spatial resolutions of $H / 4 \times W / 4$ , $H / 8 \times W / 8$ , $H / 16 \times W / 16$ and $H / 32 \times W / 32$ are produced by its convolution stem and four residual blocks. The Transformer encoder's feature representation is enhanced by these feature maps’ rich spatial information and contextual semantics.

Attentional Feature Fusion Module

Figure 3a illustrates the unique structure of the AFFM that we propose to obtain an improved interactive fusion of local-global attributes obtained from the Transformer encoder and CNN encoder.

[IMAGE OMITTED. SEE PDF]

First, we employ a convolutional component made up of 1 × 1 convolution, batch normalization (BN), and ReLU function to unify the number of channels for the feature maps $g^{i} , f^{i} \left(\right. i = 1 , 2 , 3 , 4 \left.\right)$ from the Transformer and CNN encoder and combine global and local features through element-wise summation operations. Second, we apply channel attention to compute the weight vector to re-weight local-global information and decrease the influence of unnecessary background noise, therefore making greater use of the most important feature channels.^[²⁴^] For the combined feature $m^{i} \left(\right. i = 1 , 2 , 3 , 4 \left.\right)$ , we employ a shared multi-layer perceptron following these two pooling procedures, in addition to using max pooling to retain more information and excite the feature channels. The sigmoid algorithm receives the summation of their findings, which yields the channel attention coefficient α. Finally, to reduce the distraction of unimportant noises in the background, we used fusion weights $1 - \alpha$ and α to perform element-wise multiplication on $g^{i} , f^{i}$ and then summed them to obtain the final fusion feature $F^{i}$ .^[²⁵^] It successfully combines local and global contextual information at the available spatial resolution. The overall operation process is as follows:12 $m^{i} = Conv \left(\right. g^{i} \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; Conv \left(\right. f^{i} \left.\right)$ 13 $\alpha = &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\sigma} \left{\right. &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\text{MLP}} \left(\right. &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\text{GAP}} \left(\right. m^{i} \left.\right) \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\text{MLP}} \left(\right. &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\text{GMP}} \left(\right. m^{i} \left.\right) \left.\right) \left.\right}$ 14 $F^{i} = \left(\right. 1 - \alpha \left.\right) \bigotimes \text{Conv} \left(\right. g^{i} \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; \alpha \bigotimes \text{Conv} \left(\right. f^{i} \left.\right)$

Among them, Conv(·) represents 1 × 1 convolution, BN, and ReLU activation, and the symbols for the MLP operator are $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\text{MLP}}$ , global average pooling is $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\text{GAP}}$ , global maximum pooling is $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\text{GMP}}$ , sigmoid activation is $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\sigma}$ , and elementwise multiplication is $\bigotimes$ .

Multi-scale Feature Aggregation Module

To improve the segmentation outcomes, we suggest a new Multi-scale Feature Aggregation Module (MFAM), the particulars of which are displayed in Figure 3b. Three components make up MFAM: Spatial and Channel Reconstruction Convolution (SCConv)^[²⁶^] for aggregating local data, deep strip convolution with several branches^[²⁷^] to capture context on multiple scales, and utilizing a 1 × 1 convolution to model the interchannel interaction. The 1 × 1 convolution's output is then used as attention weights to adjust the MFAM input. The processing flow of MFAM can be expressed as:15 $Attention = Conv \left(\right. \sum_{i = 0}^{3} \left(Scale\right)_{i} \left(\right. SCConv \left(\right. F \left.\right) \left.\right) \left.\right)$ 16 $\text{MFAM(} F \left.\right) = Attention \bigotimes F$

Among them, F is the input feature representation, Attention represents the attention map, Conv(·) represents 1 × 1 convolution, SCConv(·) represents 5 × 5 Spatial and Channel Reconstruction Convolution, and $\left(Scale\right)_{i} \left(\right. i = 0 , 1 , 2 , 3 \left.\right)$ represents the four branches in Figure 3b.

Using SCConv instead of standard convolution to aggregate local information can reduce redundant features and enhance feature representation while reducing complexity and computational costs. We imitate typical depth convolutions with huge kernels using two depth strip convolutions in every branch of the multi-branch depth strip convolution. It can simulate large kernels while ensuring low cost to enhance the network's pixel classification capabilities. The convolution kernel sizes of each branch are set to 7, 13, and 15 respectively.

Multi-level Feature Bridging Module

Because the collection and application of multi-scale features is essential for medical image segmentation,^[²⁸^] we have developed the Multi-level Feature Bridging Module (MFBM). As seen in Figure 4, its input is composed of three components: mask information, high-level features, and low-level features. Bilinear interpolation and depthwise separable convolution are first used to modify the dimension of the high-level features such that they match the low-level features. Following a division of the two features into four distinct groups across the channel dimension, four sets of mixed features are obtained by channel-concatenating low-level and high-level attributes, and mask information. Subsequently, information is extracted from various receptive fields using soft attention and four parallel branches of dilated convolutions with a kernel size of 3 and a dilation rate of $\left{\right. 1 , 2 , 5 , 7 \left.\right}$ . The final output of $F_{\text{M}}^{i} \left(\right. i = 1 , 2 , 3 , 4 \left.\right)$ is obtained by connecting the four feature sets along the dimension of the channel and applying standard convolution with a kernel size of 1 to produce interactions among characteristics of different scales.

[IMAGE OMITTED. SEE PDF]

Boundary Guided Decoder

To improve the model's capacity for boundary learning, we suggest a boundary-guided decoder (BGD) and employ the boundary characteristics that the boundary detection operator collected as prior knowledge to direct the decoding procedure of learning.^[²⁰^] Initially, the object's boundary information is extracted by filtering out extraneous information using shallow features $F^{i} \left(\right. i = 1 , 2 \left.\right)$ , which include a lot of border texture knowledge. The gradient map $M_{X}^{i} , M_{Y}^{i}$ is obtained by applying the Sobel operator in both the horizontal and vertical directions. It is then normalized using the sigmoid function and fused with the feature map that was provided to produce the boundary feature map $F_{\text{e}}^{i}$ :17 $F_{\text{e}}^{i} = F^{i} \bigotimes &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Fscr;_{\sigma} \left(\right. &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;Copf; \left(\right. M_{X}^{i} , M_{Y}^{i} \left.\right) \left.\right)$

Among them, ℂ represents channel series operation. Next, we perform bilinear upsampling on $F_{\text{e}}^{2}$ to match the feature map size of $F_{\text{e}}^{1}$ , and apply a 1 × 1 convolution operation to correspond with the number of channels. Ultimately, we utilize two layers of convolution and concatenate these two maps of features across the channel dimension to produce the comprehensive border feature $F_{\text{b}}$ . Following that, the decoder's learning process is guided by the comprehensive boundary feature $F_{\text{b}}$ , which serves as prior expertise. The upsampling features of the previous stage decoder and the output $F_{\text{M}}^{i}$ of the MFBM are summed element by element to obtain $F_{\text{m}}^{i}$ . After concatenating it with $F_{\text{b}}$ in the channel dimension, it is input into a convolutional layer composed of Conv, BN, and ReLU to acquire the final result.

Loss Function

To derive forecasts from decoder characteristics at multiple levels, we employ deep supervision skills, training the model with weighted functions of Binary Cross Entropy (BCE) Loss and Dice similarity coefficient (Dice) Loss. Thus, the following is an expression for the total loss function $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lagran;_{\text{Dice}}$ :18 $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lagran; = \sum_{i = 0}^{4} \left(\right. \alpha &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lagran;_{\text{BCE}} &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;plus; \left(\right. 1 - \alpha \left.\right) &#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lagran;_{\text{Dice}} \left.\right)$

Among them, $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lagran;_{\text{BCE}}$ is BCE Loss, $&#x00026;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lagran;_{\text{Dice}}$ is Dice Loss, and the weight balance parameter α, which has a value of 0.6.

Experiment and Results

Datasets

We conducted experiments on seven public datasets across different tasks and modalities to evaluate FI-Net, including the Skin lesion segmentation dataset (ISIC 2018),^[²⁹^] Colon polyp endoscopy dataset (Kvasir-SEG),^[³⁰^] the 2019 Kidney Tumor Segmentation Challenge dataset (KiTS19),^[³¹^] the 2017 Liver Tumor Segmentation Challenge dataset (LiTS17),^[³²^] the 2012 Prostate Segmentation Challenge dataset (PROMISE12),^[³³^] the breast ultrasound images dataset (BUSI)^[³⁴^] and the Shenzhen chest X-ray dataset (SZ-CXR).^[³⁵^] Among them, ISIC 2018 consists of 2594 dermoscopic images and 1000 images of gastrointestinal polyps and the segmentation masks that go with them are included in Kvasir-SEG, KiTS19 contains 210 CT images with kidney and tumor annotations, LiTS17 contains 131 abdominal liver CT scans and PROMISE12 contains 50 transverse T2-weighted magnetic resonance imaging (MRI) images. In addition, the BUSI dataset consists of 780 breast ultrasound images (437 benign cases, 210 malignant cases, and 133 normal cases), in which benign and malignant cases have corresponding lesion segmentation masks, and SZ-CXR consists of 566 chest X-ray images with lung segmentation masks. Considering that the aspect ratio of most dermoscopic images is approximately 3:4, the images in ISIC 2018 were all resampled to 224 × 320 pixels. While all remaining dataset images and masks are resized to 256 × 256. Using a 7:1:2 ratio, we divided the dataset at random for training, validation, and testing. At the same time, five-fold cross-validation is employed to guarantee that the influence of randomness is minimized.

Implementation Details and Evaluation Indicators

The PyTorch framework is used to build the suggested model, and an NVIDIA GeForce RTX 4090 24 GB is used for training. The network is optimized using the Adam optimizer with an initial learning rate of 1e-3. The network is then tweaked using the cosine annealing learning rate scheduler, and data improvement techniques like random rotation and horizontal flipping are applied.

Four assessment measures were employed in our trials using the Kvasir-SEG, ISIC 2018, BUSI, and SZ-CXR datasets to assess the model as a whole: Dice Similarity Coefficient (DI), Jaccard Index (JA), Accuracy (AC), and 95% Hausdorff Distance (95HD). In the experiments on LiTS17, KiTS19, and PROMISE12 datasets, we used four indicators: DI, Relative Volume Difference (RVD), Average Surface Distance (ASD), and Volumetric Overlap Error (VOE) to conduct an overall evaluation of the proposed network.

Evaluation Results

Results of Skin Lesion Segmentation

Ten CNN or Transformer-based models created especially for medical image segmentation are compared to the FI-Net network, including U-Net,^[⁴^] R2U-Net,^[³⁶^] Attention Unet,^[⁵^] CE-Net,^[¹⁷^] UNet 3+,^[¹⁸^] Swin-Unet,^[⁹^] TransUNet,^[¹²^] TransFuse,^[¹³^] HiFormer^[¹⁴^] and TransCeption.^[¹¹^] The quantitative analysis outcomes for the ISIC 2018 dataset are displayed on the left side of Table 1. As can be shown, our FI-Net outperforms UNet 3+ and TransCeption in terms of DI by 1.0% and 0.9%, respectively. Regarding assessment metrics like JA, AC, and 95HD, FI-Net is also better than other comparison models, reaching scores of 90.8%, 97.5% and 0.962 mm. A comparative analysis of the FI-Net visual segmentation results is presented in Figure 5.

Table 1 Quantitative comparison results on ISIC 2018 and Kvasir-SEG datasets.

Methodsa)	ISIC 2018	Kvasir-SEG
DI	JA	AC	95HD	DI	JA	AC	95HD
CNN	U-Net	0.916 ± 0.008	0.803 ± 0.014	0.952 ± 0.009	8.962 ± 1.620	0.886 ± 0.005	0.818 ± 0.003	0.934 ± 0.006	7.796 ± 0.942
R2U-Net	0.921 ± 0.005	0.836 ± 0.013	0.956 ± 0.006	5.237 ± 0.930	0.911 ± 0.003	0.841 ± 0.008	0.946 ± 0.011	5.529 ± 0.527
Attention Unet	0.937 ± 0.004	0.856 ± 0.006	0.965 ± 0.003	3.016 ± 0.380	0.927 ± 0.005	0.866 ± 0.007	0.960 ± 0.009	3.654 ± 0.467
CE-Net	0.933 ± 0.002	0.860 ± 0.009	0.960 ± 0.005	3.304 ± 0.379	0.909 ± 0.002	0.857 ± 0.006	0.952 ± 0.004	4.756 ± 0.102
UNet 3+	0.941 ± 0.001	0.877 ± 0.004	0.967 ± 0.003	2.092 ± 0.108	0.927 ± 0.002	0.876 ± 0.005	0.965 ± 0.001	2.837 ± 0.134
Transformer	Swin-Unet	0.914 ± 0.009	0.840 ± 0.007	0.959 ± 0.004	3.849 ± 0.411	0.899 ± 0.009	0.824 ± 0.002	0.943 ± 0.004	7.078 ± 1.837
TransUNet	0.932 ± 0.004	0.846 ± 0.011	0.961 ± 0.006	3.680 ± 0.448	0.916 ± 0.010	0.860 ± 0.007	0.954 ± 0.006	4.368 ± 1.332
TransFuse	0.940 ± 0.005	0.872 ± 0.010	0.968 ± 0.004	2.691 ± 0.438	0.922 ± 0.002	0.869 ± 0.002	0.959 ± 0.007	3.097 ± 0.107
HiFormer	0.941 ± 0.001	0.874 ± 0.008	0.967 ± 0.002	2.371 ± 0.203	0.935 ± 0.003	0.878 ± 0.004	0.964 ± 0.006	2.463 ± 0.204
TransCeption	0.942 ± 0.002	0.880 ± 0.006	0.969 ± 0.003	1.517 ± 0.343	0.935 ± 0.002	0.872 ± 0.003	0.961 ± 0.005	3.341 ± 0.015
Ours	0.951 ± 0.001	0.908 ± 0.004	0.975 ± 0.003	0.962 ± 0.110	0.942 ± 0.002	0.890 ± 0.005	0.967 ± 0.004	1.621 ± 0.108

[IMAGE OMITTED. SEE PDF]

Results of Polyp Segmentation

The Kvasir-SEG dataset's quantitative analysis outcomes are displayed in the right part of Table 1. The experimental findings demonstrate the effectiveness of our FI-Net in medical image segmentation. In terms of DI, JA, and 95HD, they reached 94.2%, 89.0%, and 1.621 mm respectively, which were 0.7%, 1.2%, and 0.842 mm higher than the second-best performance respectively. Furthermore, the results of FI-Net's qualitative analysis on polyp segmentation are displayed in Figure 6. Compared with competing models, our method can better capture multi-scale polyps, suppress interference caused by background noise and boundary blur, and obtain more accurate segmentation masks.

[IMAGE OMITTED. SEE PDF]

Results of Kidney Segmentation

In the comparative experiments of kidney, liver and prostate segmentation, we also selected ten methods based on CNN or Transformer for comparison, including U-Net,^[⁴^] UNet++,^[⁶^] Attention Unet,^[⁵^] CE-Net,^[¹⁷^] DeepLabv3+,^[³⁷^] Swin-Unet,^[⁹^] TransFuse,^[¹³^] MISSFormer,^[¹⁰^] HiFormer^[¹⁴^] and TransCeption^[¹¹^] The KiTS19 dataset's quantitative analysis results are displayed on the left part of Table 2. Stronger segmentation results are obtained with the method we are using in terms of DI (94.5%), ASD (1.133 mm), RVD (1.6%), and VOE (5.9%). It is noteworthy that FI-Net performs better than the second-most competitive model, TransCeption, which is a pure Transformer network. There is an improvement of 0.768 mm in ASD and significant benefits in DI, RVD, and VOE. The comparison model on KiTS19 and the visual segmentation results obtained from our method are displayed in Figure 7.

Table 2 Quantitative comparison results on KiTS19, LiTS17, and PROMISE12 datasets.

Methodsa)	KiTS19	LiTS17	PROMISE12
DI	ASD	RVD	VOE	DI	ASD	RVD	VOE	DI	ASD	RVD	VOE
CNN	U-Net	0.911 ± 0.002	5.350 ± 1.993	0.069 ± 0.004	0.089 ± 0.013	0.933 ± 0.006	4.452 ± 1.262	0.034 ± 0.010	0.119 ± 0.010	0.871 ± 0.010	2.638 ± 1.382	0.081 ± 0.002	0.244 ± 0.008
UNet++	0.919 ± 0.002	4.369 ± 1.014	0.058 ± 0.011	0.081 ± 0.016	0.936 ± 0.004	3.677 ± 0.043	0.028 ± 0.001	0.116 ± 0.005	0.886 ± 0.007	1.952 ± 0.042	0.070 ± 0.007	0.226 ± 0.004
Attention Unet	0.925 ± 0.005	3.424 ± 0.452	0.045 ± 0.013	0.079 ± 0.009	0.952 ± 0.012	2.394 ± 0.321	0.027 ± 0.012	0.092 ± 0.007	0.905 ± 0.006	1.590 ± 0.373	0.053 ± 0.002	0.175 ± 0.012
CE-Net	0.929 ± 0.009	3.290 ± 0.414	0.044 ± 0.006	0.078 ± 0.021	0.947 ± 0.001	2.795 ± 0.708	0.030 ± 0.014	0.106 ± 0.016	0.901 ± 0.004	1.747 ± 0.205	0.061 ± 0.004	0.201 ± 0.008
DeepLabv3+	0.934 ± 0.010	2.787 ± 0.624	0.033 ± 0.005	0.072 ± 0.004	0.958 ± 0.003	2.159 ± 0.530	0.019 ± 0.006	0.096 ± 0.017	0.916 ± 0.006	1.405 ± 0.531	0.041 ± 0.009	0.149 ± 0.016
Transformer	Swin-Unet	0.913 ± 0.003	4.010 ± 1.913	0.047 ± 0.005	0.083 ± 0.022	0.940 ± 0.005	3.015 ± 0.396	0.032 ± 0.005	0.113 ± 0.014	0.891 ± 0.013	1.740 ± 0.291	0.065 ± 0.009	0.196 ± 0.012
TransFuse	0.931 ± 0.004	2.986 ± 0.195	0.035 ± 0.006	0.079 ± 0.005	0.945 ± 0.007	2.866 ± 1.024	0.026 ± 0.006	0.109 ± 0.020	0.902 ± 0.008	1.722 ± 0.171	0.063 ± 0.008	0.189 ± 0.013
MISSFormer	0.933 ± 0.006	2.832 ± 1.016	0.038 ± 0.011	0.076 ± 0.015	0.949 ± 0.003	2.565 ± 0.447	0.024 ± 0.004	0.101 ± 0.003	0.919 ± 0.003	1.381 ± 0.246	0.043 ± 0.007	0.138 ± 0.006
HiFormer	0.936 ± 0.003	2.584 ± 0.048	0.032 ± 0.015	0.071 ± 0.002	0.960 ± 0.005	2.114 ± 0.472	0.021 ± 0.006	0.095 ± 0.008	0.910 ± 0.015	1.605 ± 0.153	0.058 ± 0.003	0.186 ± 0.015
TransCeption	0.938 ± 0.007	1.901 ± 0.095	0.028 ± 0.002	0.067 ± 0.017	0.956 ± 0.009	2.251 ± 0.218	0.023 ± 0.009	0.094 ± 0.010	0.911 ± 0.004	1.459 ± 0.484	0.047 ± 0.005	0.154 ± 0.011
Ours	0.945 ± 0.001	1.133 ± 0.125	0.016 ± 0.003	0.059 ± 0.015	0.967 ± 0.002	1.933 ± 0.142	0.013 ± 0.003	0.089 ± 0.007	0.926 ± 0.003	1.223 ± 0.105	0.033 ± 0.004	0.132 ± 0.005

[IMAGE OMITTED. SEE PDF]

Results of Liver Segmentation

The proposed FI-Net and rival networks’ quantitative analysis results are displayed in the center region of Table 2. It can be seen that it still performs well on the LiTS17 dataset and achieves more advanced segmentation results in liver segmentation. Compared with DeepLabv3+, HiFormer, and TransCeption, which performed the second-best in different indicators, FI-Net improved by 0.7%, 0.181 mm, 0.6%, and 0.5% in DI, ASD, RVD, and VOE respectively. Figure 8 is the qualitative comparison result of FI-Net and other models. Our suggested approach may provide more precise outlines and capture finer features.

[IMAGE OMITTED. SEE PDF]

Results of Prostate Segmentation

The rating indicator performance of FI-Net on the PROMISE12 dataset is displayed in the right part of Table 2. In comparison, our method performs better in segmentation for DI (92.6%), ASD (1.223 mm), RVD (3.3%), and VOE (13.2%). In terms of DI, compared with DeepLabv3+, MISSFormer, HiFormer, and TransCeption, they have improved by 1.0%, 0.7%, 1.6% and 1.5% respectively. The comparison model and the FI-Net prostate segmentation results are shown in Figure 9. It shows that the prostate boundaries generated by FI-Net are more consistent with the real situation than those generated by competing models, proving that prostate segmentation accuracy can be increased using our model's architecture.

[IMAGE OMITTED. SEE PDF]

Results of Breast Lesion Segmentation

Similarly, in the comparative experiments of breast lesion segmentation and lung segmentation, we also selected ten methods based on CNN or Transformer for comparison, including U-Net, UNet++, Attention Unet, CE-Net, DeepLabv3+, Swin-Unet, TransFuse, MISSFormer, HiFormer and TransCeption. The left side of Table 3 shows the quantitative segmentation performance results of our method and comparison model on the breast lesion dataset BUSI. It can be seen that our method achieved the highest performance score on all indicators. Compared with the second-best performing TransCeption, FI-Net improved by 1.1%, 1.6%, 0.8%, and 1.494 mm in DI, JA, AC, and 95HD respectively. Figure 10 shows the visual segmentation results of our method and the comparison model. It can be seen that for breast ultrasound images, the lesions segmented by FI-Net are closest to the real situation, and there are fewer mis-segments.

Table 3 Quantitative comparison results on BUSI and SZ-CXR datasets.

Methodsa)	BUSI	SZ-CXR
DI	JA	AC	95HD	DI	JA	AC	95HD
CNN	U-Net	0.845 ± 0.051	0.734 ± 0.079	0.940 ± 0.036	20.079 ± 9.714	0.935 ± 0.023	0.878 ± 0.041	0.958 ± 0.015	8.321 ± 3.265
UNet++	0.855 ± 0.028	0.747 ± 0.042	0.950 ± 0.033	15.148 ± 7.397	0.941 ± 0.020	0.890 ± 0.036	0.965 ± 0.009	6.475 ± 1.812
Attention Unet	0.861 ± 0.057	0.759 ± 0.088	0.965 ± 0.012	13.638 ± 4.365	0.947 ± 0.021	0.899 ± 0.037	0.967 ± 0.010	5.749 ± 2.159
CE-Net	0.869 ± 0.056	0.771 ± 0.089	0.962 ± 0.025	14.305 ± 6.818	0.956 ± 0.016	0.916 ± 0.029	0.976 ± 0.009	4.906 ± 1.864
DeepLabv3+	0.873 ± 0.048	0.777 ± 0.076	0.966 ± 0.029	11.461 ± 6.205	0.960 ± 0.012	0.922 ± 0.022	0.975 ± 0.008	4.297 ± 1.732
Transformer	Swin-Unet	0.856 ± 0.039	0.751 ± 0.059	0.964 ± 0.024	14.705 ± 7.759	0.944 ± 0.028	0.896 ± 0.050	0.966 ± 0.016	6.036 ± 3.263
TransFuse	0.867 ± 0.038	0.767 ± 0.058	0.970 ± 0.019	13.134 ± 6.329	0.949 ± 0.023	0.904 ± 0.041	0.970 ± 0.013	5.256 ± 2.388
MISSFormer	0.871 ± 0.045	0.774 ± 0.070	0.971 ± 0.033	10.442 ± 6.830	0.953 ± 0.011	0.911 ± 0.021	0.970 ± 0.007	5.301 ± 1.783
HiFormer	0.876 ± 0.037	0.780 ± 0.059	0.974 ± 0.017	10.600 ± 4.553	0.963 ± 0.020	0.930 ± 0.036	0.977 ± 0.011	4.005 ± 1.870
TransCeption	0.882 ± 0.047	0.791 ± 0.072	0.975 ± 0.024	9.720 ± 5.403	0.962 ± 0.008	0.926 ± 0.016	0.978 ± 0.005	3.852 ± 0.592
Ours	0.893 ± 0.021	0.807 ± 0.035	0.983 ± 0.010	8.226 ± 3.453	0.973 ± 0.003	0.947 ± 0.005	0.987 ± 0.003	2.443 ± 0.823

[IMAGE OMITTED. SEE PDF]

Results of Lung Segmentation

The right side of Table 3 is the quantitative analysis results on the SZ-CXR dataset. FI-Net achieved performance scores of 97.3%, 94.7%, 98.7%, and 2.443 mm in four performance indicators (DI, JA, AC, and 95HD) respectively, which are better than other comparison models. Compared with the competitive DeepLabv3+, HiFormer, and TransCeption, our model has improved by 2.5%, 1.7%, and 2.1% respectively in terms of JA and 1.854, 1.562, and 1.409 mm in terms of 95HD. At the same time, the qualitative segmentation results of FI-Net and the comparison model are shown in Figure 11. It can be seen that in chest X-ray images, FI-Net has stronger anti-interference ability, and its discrimination ability when facing confusing boundaries is significantly better than other comparison models.

[IMAGE OMITTED. SEE PDF]

Ablation Study

Ablation Study of Key Components

We performed several ablation experiments using the ISIC 2018 and BUSI datasets to thoroughly illustrate the efficacy of each critical component in our FI-Net. In Table 4, we compare the performance of various variants of FI-Net: 1) CNN: using CNN encoder only; 2) +Transformer: using CNN and Transformer dual-stream encoder; 3) +AFFM: Add Attentional Feature Fusion Module; 4) +MFAM: Add Multi-scale Feature Aggregation Module; 5) +MFBM: Add Multi-level Feature Bridging Module; 6) +BGD: Add boundary guidance decoder.

Table 4 Ablation study for key components on ISIC 2018 and BUSI datasets.

CNN	Transformer	AFFM	MFAM	MFBM	BGD	ISIC 2018	BUSI
DI	JA	AC	95HD	DI	JA	AC	95HD
√						0.923 ± 0.008	0.816 ± 0.012	0.955 ± 0.007	7.675 ± 2.157	0.856 ± 0.014	0.749 ± 0.022	0.951 ± 0.018	16.825 ± 5.045
√	√					0.932 ± 0.006	0.842 ± 0.013	0.961 ± 0.002	4.697 ± 1.626	0.864 ± 0.039	0.762 ± 0.059	0.962 ± 0.025	13.878 ± 5.853
√	√	√				0.937 ± 0.005	0.854 ± 0.007	0.963 ± 0.005	3.378 ± 0.742	0.869 ± 0.024	0.769 ± 0.037	0.973 ± 0.014	11.641 ± 3.950
√	√	√	√			0.943 ± 0.003	0.870 ± 0.005	0.968 ± 0.004	2.211 ± 0.764	0.876 ± 0.041	0.782 ± 0.063	0.977 ± 0.022	11.281 ± 3.274
√	√	√	√	√		0.947 ± 0.004	0.882 ± 0.011	0.970 ± 0.002	1.587 ± 0.570	0.889 ± 0.035	0.803 ± 0.056	0.980 ± 0.008	9.905 ± 3.472
√	√	√	√	√	√	0.951 ± 0.001	0.908 ± 0.004	0.975 ± 0.003	0.962 ± 0.110	0.893 ± 0.021	0.807 ± 0.035	0.983 ± 0.010	8.226 ± 3.453

By adding each key component to the ISIC 2018 dataset, JA's performance rose by 2.6%, 1.2%, 1.6%, 1.2%, and 2.5% respectively. On the BUSI dataset, with the addition of various components, JA's performance improved by 1.3%, 0.7%, 1.3%, 2.1%, and 0.4% respectively. The experimental results fully show that each key component we designed has an impact on the final performance, and the segmentation performance of medical images is improved from various angles. Figure 12 shows the role of each component in ablation research more intuitively on the ISIC 2018 dataset.

[IMAGE OMITTED. SEE PDF]

Ablation Study of Components Micro-Design

To investigate the optimal design approach, we performed ablation studies on the Transformer backbone, MFAM, MFBM, and BGD micro-designs using the ISIC 2018 dataset. Table 5 displays the outcomes of the experiment. In Table 5a, five mainstream Transformer backbones are used to compare with our method, including Swin,^[³⁸^] DeiT,^[³⁹^] PVT,^[⁴⁰^] CSWin,^[⁴¹^] and CrossFormer.^[⁴²^] Our backbone design has been seen to have improved by 2.4% and 0.471 mm in JA and 95HD, respectively, as compared to the next most competitive CSWin. Table 5b,c show the results of the exploration of the effectiveness of each part of the MFAM and MFBM components respectively. It is evident that every component of MFAM adds to the overall performance, and the mask information in MFBM plays a key role in guiding feature fusion. Table 5d shows how features are combined in BGD to generate more accurate boundary guidance. Comprehensive features have a better-guiding impact than single characteristics, as may be observed.

Table 5 Ablation study for components micro-design on ISIC 2018 dataset.

(a) Ablation study for Transformer Backbone
Methods	DI	JA	AC	95HD
Swin	0.945 ± 0.008	0.875 ± 0.008	0.968 ± 0.004	2.102 ± 0.647
DeiT	0.941 ± 0.005	0.866 ± 0.006	0.967 ± 0.002	2.842 ± 1.262
PVT	0.946 ± 0.004	0.878 ± 0.002	0.970 ± 0.004	1.558 ± 0.182
CSWin	0.948 ± 0.002	0.884 ± 0.005	0.971 ± 0.007	1.433 ± 0.134
CrossFormer	0.947 ± 0.003	0.876 ± 0.003	0.970 ± 0.001	1.597 ± 0.479
Ours	0.951 ± 0.001	0.908 ± 0.004	0.975 ± 0.003	0.962 ± 0.110
(b) Ablation study for MFAM
Multi-branch Conv	1 × 1 Conv	Attention	SCConv	DI	JA	AC	95HD
	√	√		0.941 ± 0.003	0.869 ± 0.009	0.966 ± 0.001	2.170 ± 1.071
√		√		0.943 ± 0.005	0.874 ± 0.005	0.969 ± 0.003	1.864 ± 0.704
√	√			0.944 ± 0.002	0.879 ± 0.004	0.969 ± 0.002	1.574 ± 0.369
√	√	√		0.947 ± 0.003	0.896 ± 0.006	0.971 ± 0.001	1.257 ± 0.344
√	√	√	√	0.951 ± 0.001	0.908 ± 0.004	0.975 ± 0.003	0.962 ± 0.110
(c) Ablation study for MFBM
Mask	Dilation Conv	Soft Attention	DI	JA	AC	95HD
	√	√	0.948 ± 0.003	0.889 ± 0.007	0.970 ± 0.004	1.511 ± 0.213
√		√	0.946 ± 0.002	0.890 ± 0.005	0.971 ± 0.009	1.491 ± 0.202
√	√		0.947 ± 0.004	0.896 ± 0.008	0.974 ± 0.002	1.235 ± 0.146
√	√	√	0.951 ± 0.001	0.908 ± 0.004	0.975 ± 0.003	0.962 ± 0.110
(d) Ablation study for BGD
$F^{1}$	$F^{2}$	DI	JA	AC	95HD
√		0.948 ± 0.003	0.894 ± 0.002	0.972 ± 0.004	1.217 ± 0.438
	√	0.949 ± 0.004	0.895 ± 0.007	0.972 ± 0.008	1.204 ± 0.307
√	√	0.951 ± 0.001	0.908 ± 0.004	0.975 ± 0.003	0.962 ± 0.110

Feature Map Visualization

We used feature map visualization on the ISIC 2018 dataset, as illustrated in Figure 13, to more naturally identify the efficacy of Dual-Stream Encoder and AFFM. As may be observed, the Transformer branch records global dependencies, but the CNN branch records more precise local detail. Consequently, AFFM offers improved local and global feature representation for medical images by efficiently extracting local details and global semantic information in medical images to the fullest degree possible.

[IMAGE OMITTED. SEE PDF]

Discussion

The requirement for accurate medical image analysis (preoperative evaluation, auxiliary diagnosis) has increased along with improvements in medical standards and public health awareness.^[¹⁰^] Segmenting medical images is an important step. Robust and accurate segmentation findings will lay a solid basis for further processing and analysis.^[⁴³^] However, pixel-level segmentation of organs and lesions in medical images has long been a major difficulty in the field of smart medical care, as medical images typically suffer from issues including high-density noise, low contrast, and blurring edges.

Medical image segmentation approaches based on convolutional neural networks (CNN), represented by U-Net,^[⁴^] have gained exceptional performance and impressive outcomes in recent years due to the rapid growth of deep learning in the field of computer vision. However, CNN-based methods have limitations when it comes to representing long-range relationships because of the intrinsic locality of convolution.^[⁸^] Transformer-based methods also have the drawback of having weak local representations, despite their ability to encode shape representations and capture long-range dependencies.^[¹²^] As a result, several medical image segmentation methods that combine the benefits of Transformer and CNN have started to appear.^[¹³^] These methods have demonstrated exceptional performance in a range of medical image segmentation tasks. These methods, however, typically overlook the interactive information between various features in medical images and do not deeply mine multi-scale and local-global features in these images.

We propose FI-Net, a U-Net-like network for medical image segmentation, as a solution to the above issues. To extract rich local-global context information, it rethinks feature interaction information in medical image segmentation and employs a dual-stream encoder based on CNN and Transformer. The three main contributions of FI-Net are the introduction of three novel modules, namely AFFM, MFAM, and MFBM. To maximize the preservation of both local details and global semantic information in medical images, AFFM employs channel attention to steer local-global features for efficient interactive fusion. MFAM mines more semantic details by aggregating local information, capturing multi-scale context, and modeling inter-channel relationships to further refine segmentation results. To effectively mine latent semantic information during the feature interaction process, MFBM bridges and extracts multi-scale feature information from medical images using multi-level features and Mask information. FI-Net conducted sufficient experiments on seven medical image segmentation tasks in six different imaging modes to explore its effectiveness, including skin lesion segmentation in dermoscopic images, colon polyp segmentation in endoscopic images, kidney and liver organ segmentation in CT images, prostate organ segmentation in MR images, and breast lesion segmentation in ultrasound images and lung organ segmentation in X-ray images. The corresponding quantitative evaluation results (Table 1–3) and qualitative evaluation results (Figure 5–11) fully demonstrate the superiority and robustness of FI-Net. Compared with other SOTA methods, FI-Net can achieve more accurate segmentation edge details and better performance scores. This demonstrates FI-Net's potential applicability in various clinical settings and its adaptability and effectiveness in handling various medical imaging segmentation challenges. At the same time, FI-Net can provide accurate and reliable shape information for the accurate segmentation of a variety of medical images, providing valuable guidance to clinicians, and thus contributing to many further clinical applications such as disease diagnosis, quantitative analysis, and surgical planning.

However, our approach still has some limitations. First, our model shows excellent segmentation performance on seven different modalities and different organ datasets, but it is based on 2D segmentation results and lacks support for 3D medical image segmentation tasks. Second, since manual annotation of medical images is an expensive and time-consuming task, this results in a limited amount of available image training data, limiting the peak performance of the model and potentially leading to overfitting problems.^[¹⁴^] Finally, to enhance the clinical friendliness of the model, it is necessary to study more lightweight networks with stronger feature extraction capabilities while maintaining the segmentation performance of the model. Therefore, in future work, we will draw inspiration from the latest 3D medical image segmentation research^[⁴⁴^] and transform the structure of existing models by utilizing techniques such as anisotropic 3D convolution, 3D convolution decomposition, and prompt. This enables it to better utilize the three-dimensional spatial information between slices, thereby enhancing its performance in medical image volume segmentation. Second, due to the high imaging cost and data privacy issues of medical images, publicly available large-scale labeled datasets are limited. We will therefore leverage large-scale unlabeled medical images through self-supervised learning to enhance model robustness and mitigate the risk of overfitting.^[⁴⁵^] Furthermore, we plan to build two models with different parameter scales and utilize knowledge distillation techniques^[⁴⁶^] to transfer effective compression features from the large model (teacher model) to the small model (student model). This ensures that the model performance is significantly improved while significantly reducing the number of parameters and the amount of calculations to improve the deployability of clinical equipment.

Conclusion

We present FI-Net, a CNN-Transformer hybrid network designed for medical image segmentation in this paper. Using distant dependencies modeling and local relationship modeling, it maintains both the global semantic aspects and local information of medical images. In our work, five useful components are proposed, including a dual-stream encoder that captures local-global information, AFFM that interactively fuses dual-branch coding features, MFAM and MFBM that use multi-branch strip convolution and multi-level features to interactively extract multi-scale features, and boundary-guided decoder that utilize boundary prior knowledge to guide the decoding process. The efficiency and development of FI-Net are demonstrated by the results of our experiments on seven datasets with various tasks and modalities. In future work, we plan to extend FI-Net to support 3D medical image segmentation tasks and combine self-supervised learning, knowledge distillation, and other technologies to alleviate the overfitting problem of limited data training and enhance the robustness of the model.

Acknowledgements

We sincerely thank all participants in the study. This work was supported by the grant from Hunan Provincial Natural Science Foundation of China (grant no. 2021JJ41026) and the Fundamental Research Funds for the Central Universities of Central South University.

Conflict of Interest

The authors declare no conflict of interest.

Author Contributions

Y.Z.L. and W.Y.J. conceived and supervised the study. D.Y.H., H.Y.B., and H.J.L. contributed to data collection and assembly. D.Y.H., L.J.H., and L.H.S. performed data analysis and interpretation. D.Y.H., Y.Z.L., and L.J.H. performed software, visualization and validation. All authors contributed to writing the manuscript. All authors reviewed and approved the final manuscript.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

J. Cheng, C. Gao, F. Wang, M. Zhu, in Int. Conf. on Medical Image Computing and Computer‐Assisted Intervention, Springer 2023, pp. 64–74.

P. Suetens, Fundamentals of Medical Imaging, Cambridge University Press 2017.

J. Long, E. Shelhamer, T. Darrell, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition 2015, pp. 3431–3440.

O. Ronneberger, P. Fischer, T. Brox, in Medical Image Computing and Computer‐Assisted Intervention–MICCAI 2015: 18th Int. Conf., Munich, Germany, October 5–9, 2015, Proc., Part III 18, Springer 2015, pp. 234–241.

O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al., (Preprint), arXiv:1804.03999, v1, submitted: Apr. 2018.

Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, J. Liang, IEEE Trans. Med. Imaging 2019, 39, 1856.

A. Karaali, R. Dahyot, D. J. Sexton, in Int. Conf. on Pattern Recognition and Artificial Intelligence, Springer, 2022, pp. 198–210.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (Preprint), arXiv:2010.11929, v1, submitted: Oct. 2020.

H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, in European Conf. on Computer Vision, Springer 2022, pp. 205–218.

X. Huang, Z. Deng, D. Li, X. Yuan, Y. Fu, in IEEE Transactions on Medical Imaging 2022.

R. Azad, Y. Jia, E. K. Aghdam, J. Cohen‐Adad, D. Merhof (Preprint), arXiv:2301.10847, v1, submitted: Jan. 2023.

J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, Y. Zhou (Preprint), arXiv:2102.04306, v1, submitted: Feb. 2021.

Y. Zhang, H. Liu, Q. Hu, in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th Int. Conf., Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, Springer, 2021, pp. 14–24.

M. Heidari, A. Kazerouni, M. Soltany, R. Azad, E. K. Aghdam, J. Cohen‐Adad, D. Merhof, in Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision 2023, pp. 6202–6212.

Y. Ding, Z. Yi, M. Li, J. Long, S. Lei, Y. Guo, P. Fan, C. Zuo, Y. Wang, Digital Health 2023, 9, 20552076231207197.

Y. Ding, Z. Yi, J. Xiao, M. Hu, Y. Guo, Z. Liao, Y. Wang, iScience 2024, 27, 109442.

Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, J. Liu, IEEE Trans. Med. Imaging 2019, 38, 2281.

H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.‐W. Chen, J. Wu, in ICASSP 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), IEEE 2020, pp. 1055–1059.

D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, H. D. Johansen, in 2020 IEEE 33rd Int. Symp. on Computer‐Based Medical Systems (CBMS), IEEE 2020, pp. 558–564.

Y. Lin, D. Zhang, X. Fang, Y. Chen, K.‐T. Cheng, H. Chen, in Int. Conf. on Information Processing in Medical Imaging, Springer 2023, pp. 730–742.

L. Zhu, X. Wang, Z. Ke, W. Zhang, R. W. Lau, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition 2023, pp. 10323–10333.

J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, C. Xu, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition 2022, pp. 12175–12185.

S.‐H. Gao, M.‐M. Cheng, K. Zhao, X.‐Y. Zhang, M.‐H. Yang, P. Torr, IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652.

C. Dong, S. Xu, D. Dai, Y. Zhang, C. Zhang, Z. Li, Med. Image Anal. 2023, 85, 102745.

Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, K. Barnard, in Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, 2021, pp. 3560–3569.

J. Li, Y. Wen, L. He, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition 2023, pp. 6153–6162.

M.‐H. Guo, C.‐Z. Lu, Q. Hou, Z. Liu, M.‐M. Cheng, S.‐M. Hu, Adv. Neural Inf. Process. Syst. 2022, 35, 1140.

J. Ruan, M. Xie, J. Gao, T. Liu, Y. Fu, in Int. Conf. on Medical Image Computing and Computer‐Assisted Intervention, Springer 2023, pp. 481–490.

N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al. (Preprint), arXiv:1902.03368, v1, submitted: Feb. 2019.

D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, H. D. Johansen, in MultiMedia Modeling: 26th Int. Conf., MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proc., Part II 26, Springer 2020, pp. 451–462.

N. Heller, F. Isensee, K. H. Maier‐Hein, X. Hou, C. Xie, F. Li, Y. Nan, G. Mu, Z. Lin, M. Han, G. Yao, Y. Gao, Y. Zhang, Y. Wang, F. Hou, J. Yang, G. Xiong, J. Tian, C. Zhong, J. Ma, J. Rickman, J. Dean, B. Stai, R. Tejpaul, M. Oestreich, P. Blake, H. Kaluzniak, S. Raza, J. Rosenberg, K. Moore, et al., Med. Image Anal. 2021, 67, 101821.

P. Bilic, P. Christ, H. B. Li, E. Vorontsov, A. Ben‐Cohen, G. Kaissis, A. Szeskin, C. Jacobs, G. E. H. Mamani, G. F. Lohöfer, J. W. Holch, W. Sommer, F. Hofmann, A. Hostettler, N. Lev‐Cohain, M. Drozdzal, M. M. Amitai, R. Vivanti, J. Sosna, I. Ezhov, A. Sekuboyina, F. Navarro, F. Kofler, J. C. Paetzold, S. Shit, X. Hu, J. Lipková, M. Rempfler, M. Piraud, et al., Med. Image Anal. 2023, 84, 102680.

G. Litjens, R. Toth, W. Van De Ven, C. Hoeks, S. Kerkstra, B. Van Ginneken, G. Vincent, G. Guillard, N. Birbeck, J. Zhang, R. Strand, F. Malmberg, Y. Ou, C. Davatzikos, M. Kirschner, F. Jung, J. Yuan, W. Qiu, Q. Gao, P. Edwards, B. Maan, F. Van Der Heijden, S. Ghose, J. Mitra, J. Dowling, D. Barratt, H. Huisman, A. Madabhushi, Med. Image Anal. 2014, 18, 359.

W. Al‐Dhabyani, M. Gomaa, H. Khaled, A. Fahmy, Data Brief 2020, 28, 104863.

S. Jaeger, S. Candemir, S. Antani, Y.‐X. J. Wáng, P.‐X. Lu, G. Thoma, Quant. Imaging Med. Surg. 2014, 4, 475.

M. Z. Alom, C. Yakopcic, M. Hasan, T. M. Taha, V. K. Asari, J. Med. Imaging 2019, 6, 014006.

L.‐C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, in Proc. of the European Conf. on Computer Vision (ECCV) 2018, pp. 801–818.

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, in Proc. of the IEEE/CVF Int. Conf. on Computer Vision 2021, pp. 10012–10022.

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, in Int. Conf. on Machine Learning, PMLR 2021, pp. 10347–10357.

W. Wang, E. Xie, X. Li, D.‐P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, in Proc. of the IEEE/CVF Int. Conf. on Computer Vision 2021, pp. 568–578.

X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, B. Guo, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition 2022, pp. 12124–12134.

W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, W. Liu, in Int. Conf. on Learning Representations 2022.

T. Lei, R. Sun, Y. Wan, Y. Xia, X. Du, A. K. Nandi (Preprint), arXiv:2306.04086, v1, submitted: Jun. 2023.

H. Wang, S. Guo, J. Ye, Z. Deng, J. Cheng, T. Li, J. Chen, Y. Su, Z. Huang, Y. Shen, et al., (Preprint), arXiv:2310.15161, v1, submitted: Oct. 2023.

Y. Yan, R. Liu, H. Chen, L. Zhang, Q. Zhang, IEEE J. Biomed. Health Inf. 2023, 27, 4341.

D. Qin, J.‐J. Bu, Z. Liu, X. Shen, S. Zhou, J.‐J. Gu, Z.‐H. Wang, L. Wu, H.‐F. Dai, IEEE Trans. Med. Imaging 2021, 40, 3820.

Word count: 8805

Show less

© 2024. This work is published under https://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

To solve the problems of existing hybrid networks based on convolutional neural networks (CNN) and Transformers, we propose a new encoder–decoder network FI‐Net based on CNN‐Transformer for medical image segmentation. In the encoder part, a dual‐stream encoder is used to capture local details and long‐range dependencies. Moreover, the attentional feature fusion module is used to perform interactive feature fusion of dual‐branch features, maximizing the retention of local details and global semantic information in medical images. At the same time, the multi‐scale feature aggregation module is used to aggregate local information and capture multi‐scale context to mine more semantic details. The multi‐level feature bridging module is used in skip connections to bridge multi‐level features and mask information to assist multi‐scale feature interaction. Experimental results on seven public medical image datasets fully demonstrate the effectiveness and advancement of our method. In future work, we plan to extend FI‐Net to support 3D medical image segmentation tasks and combine self‐supervised learning and knowledge distillation to alleviate the overfitting problem of limited data training.

Details

Title

FI‐Net: Rethinking Feature Interactions for Medical Image Segmentation

Author

Ding, Yuhan¹; Liu, Jinhui²; He, Yunbo²; Huang, Jinliang²; Liang, Haisu²; Yi, Zhenglin²

; Wang, Yongjie³

¹ School of Computer Science and Engineering, Central South University, Changsha, China
² Departments of Urology, Xiangya Hospital, Central South University, Changsha, China
³ Department of Burns and Plastic Surgery, Xiangya Hospital, Central South University, Changsha, China, National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, China

Section

Research Article

Publication year

2024

Publication date

Dec 1, 2024

Publisher

John Wiley & Sons, Inc.

e-ISSN

26404567

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/aisy.202400201

ProQuest document ID

3148352873

FI‐Net: Rethinking Feature Interactions for Medical Image Segmentation

Jump to:

Full Text

Abstract

Details

Suggested sources