Full Text

Turn on search term navigation

1. Introduction

As the need for human–robot interaction grows, computer vision has increasingly embraced cross-modal fusion—a technique that combines different data types, such as visual and linguistic information. This interdisciplinary approach has given rise to various tasks, including referring image segmentation (RIS), visual question answering, and video text retrieval [1,2,3]. This paper specifically addresses RIS, a fundamental yet complex challenge within the multimodal domain. RIS requires a deep understanding of both visual and linguistic elements to segment specific instances within an image. Unlike other tasks, RIS demands a nuanced comprehension of the image’s content to identify and segment the region mentioned in a given textual description [4,5]. Typically, the region of interest in RIS is an object or substance within the image, and the accompanying text provides clues about the object’s action, category, color, location, and other attributes. The overall process is shown in Figure 1. This task requires a model that not only interprets the text but also correlates it with the visual elements of the image to accurately delineate the specified area [4,5,6].

Researchers have traditionally approached RIS by adapting standard image segmentation processes. This involves extracting features from both visual and textual inputs, followed by integrating these features to create a composite set of multimodal features that inform the prediction of segmentation masks [4,7,8]. Among them, Chen and Li et al. both adopted an iterative approach to refine the segmentation mask by gradually integrating visual and textual features [5,9], while Ding, Feng, and Hu et al. introduced an attentional mechanism to enhance the model’s focus on key visual regions and textual descriptions [10,11,12]. In addition, Liu and Margffoy-Tuay et al. emphasized the dynamic interaction of visual and textual information during the segmentation process [6,13], while Shi and Ye et al. focused on how to extract key information from textual descriptions to better integrate them with visual information [14,15]. Conventional models for image segmentation typically use dense binary classification networks to determine the membership of each pixel to the target object [16,17,18,19]. However, this pixel-centric approach often overlooks the relational structure between global pixels, which is crucial for accurate segmentation. To address this limitation, Jiang Liu and colleagues introduced PolyFormer [20], a groundbreaking Transformer-based sequence-to-sequence (seq2seq) framework. PolyFormer represents segmentation masks—not as dense binary classifications but as a collection of sparsely distributed polygonal vertices. These vertices trace the contours of the objects referenced in the textual descriptions, offering a more structurally aware approach to segmentation [21,22]. In this regard, Acuna and Castrejon et al. have used polygon vertex prediction methods for instance segmentation. These methods provide a more structured and sparse representation by predicting the vertices of an object’s contour instead of performing pixel-level classification [23,24]. This idea is further extended by Liang et al., who combine a mask generated by a segmentation network with a deformation network to optimize polygons, allowing them to fit object boundaries more accurately [25]. In addition, Xie et al. use a polar coordinate system to represent and predict the contours of objects, which performs well when dealing with objects with complex shapes and orientation changes [26]. PolyFormer’s innovative prediction framework has led to significant performance improvements, achieving state-of-the-art (SOTA) results on the widely recognized RefCOCO [27], RefCOCO+ [27], and RefCOCOg [28] datasets. Despite these advancements, PolyFormer’s extensive parameter count and substantial training data requirements present new challenges, particularly for optimization on consumer-grade graphics cards. In addition, in some complex scenarios, Polyformer also has cases where the target is detected correctly but the segmentation result is wrong. As shown in Figure 2.

To address these challenges, we developed an innovative strategy that leverages the equivalent substitutability of a kernel attention network (KAN) [29] and multi-layer perceptron (MLP). Our solution builds upon the PolyFormer framework by integrating a KAN decoder branch, functioning as a classification head. This branch is structurally similar to the existing MLP decoder branch and utilizes the multimodal features generated by the same encoder. To enhance the KAN decoder branch’s performance, we introduced a multiscale feature fusion module, combining features extracted from the image and text encoders with those processed by the PolyFormer encoder. These multiscale features enrich the model’s ability to interpret and segment the image based on the textual description. At the same time, to minimize the training costs associated with the PolyFormer framework, we adopted a freeze–fine-tuning approach. This method involves freezing the encoders, which constitute a significant portion of PolyFormer’s parameters, and fine-tuning only the newly added KAN decoder branch. This strategy allows for efficient training while preserving the model’s foundational capabilities. Furthermore, to fully harness the potential of both MLP and KAN classification heads, we designed a dual-branch decoder architecture based on the PolyFormer framework, employing an ensemble learning strategy. By combining the insights from both branches, we aim to maximize the model’s predictive accuracy and robustness in segmenting the image regions as described in the text. While the dual-branch decoder module, enhanced by the KAN classification header, offers some improvement in model segmentation accuracy, the impact is not as pronounced as desired. The limitation arises from the similarity of features derived by the dual-decoder branches during the encoding stage, leading to minimal variance in their respective predictions. To fully capitalize on the ensemble learning strategy, it is essential to incorporate a segmentation model that can effectively complement the outputs of the PolyFormer. Recognizing the compatibility between our decoder’s output and the input requirements of the SAM (segment anything model) [30], we devised an innovative approach to enhance image segmentation. SAM, developed by Meta, is a cue-based model renowned for its zero-shot segmentation capabilities, segmenting arbitrary images based on simple cues such as dots, boxes, or masks. Leveraging this compatibility, we introduced a SAM-based algorithmic framework tailored for seq2seq RIS. This framework uses our decoder’s output as input cues for SAM, integrating SAM’s predictions with the outputs of our dual-decoder to enhance segmentation performance in RIS tasks.

To substantiate the efficacy of our SAM-complemented dual-decoder ensemble RIS framework, we conducted comprehensive experiments on the widely recognized public datasets: RefCOCO, RefCOCO+, and RefCOCOg. These experiments benchmarked our method against other leading models in the field, demonstrating that it achieves SOTA performance. Additionally, we conducted ablation experiments to gain a deeper understanding of the contributions of individual components within our framework. These experiments highlighted how each element contributes to the overall improvement in segmentation performance.

In this paper, our contributions are fourfold:

A novel hybrid framework for referring image segmentation: The dual-decoder model with SAM complementation achieves an improvement in referring image segmentation results than other SOTA models.
A novel dual-decoder framework with KAN is proposed to increase the prediction accuracy of segmentation target edge coordinate points.
We propose a SAM-based referring image segmentation results completion module. It could further complement the segmentation result on the prediction of our decoder.
We have successfully used our framework for referring image segmentation tasks on three open datasets and have surpassed other state-of-the-art methods.

2. Related Work

2.1. Traditional Pixel-to-Pixel Segmentation Methods

Early research in referring image segmentation (RIS) predominantly utilized pixel-level classification techniques [16,17,19]. Convolutional neural networks (CNNs) were commonly employed to extract features from images, while recurrent neural networks (RNNs) or long short-term memory networks (LSTMs) were used to process and comprehend textual descriptions. These features were then fed into dense binary classification networks to determine the association of each pixel with the target object, a strategy aligned with the properties of convolutional operations [18].

For example, the referring image segmentation using a model with inception and memory (RMI) leverages the VGG [31] architecture for visual feature extraction and an LSTM for linguistic feature processing, followed by a fully convolutional network (FCN) for pixel-level classification [6]. Similarly, the recurrent refining network (RRN) uses a ConvLSTM decoder to refine the segmentation of the target region progressively [5]. In recent years, with the rise of Transformer, researchers began to process textual information by Transformer models. These works, such as LAVT and SADLR, have also adopted the Transformer architecture and introduced a cascading decoding paradigm. This approach progressively refines segmentation results, building upon initial predictions to achieve higher accuracy and better capture the nuances of the segmentation task [32,33]. In addition, a visual large model called SAM was introduced by Meta, known for its zero-shot segmentation performance, which can segment any image based on simple prompts such as points, frames, or masks [30]. While these methods are effective in some cases, their reliance on pixel-to-pixel logic still overlooks the structural relationships among global pixels. This oversight limits their accuracy in complex visual scenes, particularly those with intricate details or overlapping objects.

2.2. Sequence-to-Sequence Segmentation Methods

Sequence-to-sequence (seq2seq) prediction methods [34], initially successful in natural language processing (NLP), have extended their impact to referring image segmentation (RIS). With the advent of the Transformer architecture, seq2seq modeling has become a powerful tool for tasks requiring sequence understanding and generation, such as language translation, text summarization, and RIS. Unlike traditional pixel-level prediction methods that focus on local patterns and individual pixels, seq2seq models capture the broader context and structure of the data [35]. They transform the image and text into a serialized form, effectively processed by the Transformer’s attention mechanisms, which consider the entire image during predictions.

Therefore, Chen et al. proposed a single model architecture and loss function capable of handling diverse computer vision tasks, including object detection, instance segmentation, keypoint detection, and image captioning [36]. By formulating the output of each task as a sequence of discrete tokens, this method demonstrates that a unified interface can be effectively trained on multiple tasks without task-specific customization. Similarly, Wang et al. introduced a framework that supports a wide range of vision and language tasks within a simple seq2seq learning framework [37]. It achieves this by pretraining on a diverse set of tasks, including image generation, visual grounding, image captioning, image classification, and language modeling. However, while the above frameworks demonstrate the versatility of the seq2seq approach, they do not outperform traditional segmentation frameworks when dealing with more complex segmentation tasks.

In addition, PolyFormer exemplifies this approach in RIS, using a seq2seq framework to represent a segmentation mask as a sequence of polygonal vertices rather than a pixel grid [20]. PolyFormer outputs these vertices in an autoregressive manner, predicting each vertex based on the previous ones, resulting in a coherent and structured segmentation mask. While this approach captures the overall structure of the object more accurately, due to the complexity of the transformer architecture, it would consume a significant amount of computational resources to further improve the performance of the model through full parameter training.

3. Method

3.1. Overall Framework

In our research, we introduce a novel algorithmic framework that integrates dual-decoder with SAM complementation, as illustrated in Figure 3. This framework addresses the shortcomings of conventional referring image segmentation techniques, particularly when faced with intricate visual scenes. Additionally, it aims to overcome the challenges posed by current models, especially their parameter count and the volume of training data needed. The core innovation of this framework lies in enhancing the precision and reliability of image segmentation through the synergistic application of SAM and sequence-to-sequence methods, coupled with a dual-decoder architecture. The framework employs an ensemble learning strategy, allowing for a more holistic and nuanced understanding of the image and its associated text.

The process begins with a multimodal feature extraction encoder, which includes a visual encoder based on the Swin Transformer. This encoder efficiently extracts both local and global features from the image using its hierarchical structure and windowing mechanism. In parallel, a text encoder using the BERT model deeply understands the semantic information of the input text. These visual and textual features are fused through a fully connected layer and projection operation to form a unified multimodal feature representation, laying the groundwork for segmentation tasks. We implement a two-branch decoder structure with MLP and KAN as classification heads. The MLP decoder predicts continuous floating-point coordinates directly, avoiding quantization errors, while the KAN decoder enhances the model’s ability to capture multi-granularity information with a multiscale feature fusion module. The outputs from both decoders are intelligently merged through an ensemble learning strategy, aiming to improve segmentation accuracy. To further enhance segmentation performance under the constraint of limited model parameters, we introduce a SAM-based segmentation target complementation module. This module uses SAM’s zero-sample learning capability to refine the integration results of the dual-decoder. When the results of the dual-decoder are suboptimal, hint data are fed into SAM via a judgment module and a positive sample point generation module. SAM then produces and refines complementary results through a noise processing module, enhancing the accuracy of the final segmentation.

This comprehensive approach significantly improves the accuracy and robustness of image segmentation by combining advanced visual and textual coding techniques with a dual-decoder architecture and an ensemble learning strategy, all complemented by SAM’s powerful segmentation capabilities.

3.2. Encoder

The proposed hybrid framework encoder consists of a visual encoder, a text encoder, and a multimodal Transformer encoder.

The visual encoder is built upon the Swin Transformer, a Transformer-based architecture adept at handling image processing tasks. It extracts image features via a hierarchical structure and a windowing mechanism, capturing both local and global information within an image. Specifically, we utilize the fourth layer of features extracted by the Swin Transformer as our visual representation, denoted as $F_{V}$ , with dimensions $H \times W \times C$ , where H and W represent the height and width of the feature map, and C is the channel depth.

The text encoder employs BERT, a pre-trained deep bidirectional Transformer widely used in natural language processing. BERT’s proficiency in extracting rich semantic features from text grants our model a nuanced comprehension of referential expressions. For the input linguistic description, BERT generates linguistic embedding features, denoted as $F_{L}$ , with dimensions $L \times C_{l}$ , where L is the length of the text and $C_{l}$ is the embedding dimension.

To effectively merge the visual and textual features, the image feature $F_{V}$ is first flattened into a sequence $F_{V}^{'}$ . It is then projected into the same embedding space as the text feature $F_{L}$ through a fully connected layer. The visual and textual features are resized to a uniform dimension by learned matrices $W_{V}$ and $W_{I}$ , along with bias vectors $b_{V}$ and $b_{I}$ , yielding transformed features $F_{V}^{'}$ and $F_{L}^{'}$ . These are concatenated to form a comprehensive multimodal feature $F_{M}$ . The $F_{M}$ is then processed by a multimodal Transformer encoder, which consists of N Transformer layers. Each layer is equipped with a multi-head self-attention mechanism, layer normalization, and a feed-forward network. Additionally, we incorporate absolute position encoding and relative position bias for both image and text features, ensuring that positional information is conserved throughout the encoding process.

3.3. Dual-Decoder

The decoder of our model is a sophisticated two-branch structure featuring an MLP and a KAN as the classification heads. The MLP classification head is equipped with a regression-based Transformer decoder that predicts continuous floating-point coordinates directly. This approach circumvents the quantization errors typical of methods that discretize coordinates, ensuring a more precise representation of the segmented object’s edges. The objective of this decoder is to forecast a sequence of coordinates that delineate the vertices of the target polygon, effectively outlining the target object. It is composed of N Transformer decoder layers, each incorporating a multi-head self-attention mechanism, a multi-head cross-attention layer, and a feed-forward network. These components work in concert to discern the relationships between the multimodal feature $F_{N M}$ and the 2D coordinate embedding $e (x, y)$ . In contrast to traditional techniques, our framework, based on PolyFormer, devises a 2D coordinate codebook to accurately capture the embedding of any floating-point coordinate. The codebook is defined as $D \in R^{B H \times B W \times C_{e}}$ , where $B H$ and $B W$ denote the height and width of the coordinate grid, and $C_{e}$ is the embedding dimension. The decoder retrieves the precise coordinate embedding from surrounding grid points through bilinear interpolation, using the following formula:

$e (x, y) = \sum_{i = 1}^{B H} \sum_{j = 1}^{B W} D [i, j, :] \cdot w_{i} (x) \cdot w_{j} (y)$

The MLP decoder branch concludes with a sophisticated design to yield precise coordinate predictions for the segmentation target’s edge points. Specifically, the final output from the last Transformer decoder layer is processed by two lightweight category headers, both constructed with multi-layer perceptrons (MLPs). The category prediction header is a two-layer MLP designed to output a marker type. This marker indicates the nature of the current output, whether it is a coordinate marker, a separator marker, or an end-of-sequence marker, which is crucial for understanding the structure of the segmentation output. Simultaneously, the coordinate prediction header functions as a three-layer feed-forward network, tasked with predicting the top-left and bottom-right coordinates of the segmented target’s prediction frame, as well as the 2D coordinates of the segmented target’s edge vertices. This allows for a detailed and accurate delineation of the target object within the image.

The KAN decoder branch, while fundamentally similar to the MLP branch, introduces two main structural differences. Firstly, the prediction header is substituted with the kernelized attention network (KAN) in place of the MLP, potentially offering a more nuanced understanding of the multimodal features. Secondly, the KAN branch incorporates a multiscale feature fusion module at the decoder’s input stage. This module merges the initial multimodal features extracted from the visual and text encoders with those from the multimodal Transformer encoder, providing a comprehensive feature set for the decoder to process. The KAN branch decoder is intended to rapidly complement the output from the MLP branch decoder, even when utilizing a consumer graphics card. The inclusion of the multiscale feature fusion module allows the decoder to ingest unfused visual and textual features, capturing coarse-grained information and offering a differentiated prediction result from the MLP decoder branch. As shown in Figure 4, in order to further improve the performance of the framework with limited hardware resources, we only use the KAN decoder branch output as the final output of the framework in the training phase, while freezing the rest of the parameters and training only the KAN decoder branch. The primary structure of the KAN decoder, as depicted in Figure 5, showcases this innovative approach, ensuring that our framework is not only accurate but also efficient, and capable of handling complex segmentation tasks with limited computational resources. This dual-branch strategy, with the ensemble learning from both MLP and KAN decoders, significantly enhances the segmentation performance of our framework.

3.4. Dual-Decoder Output Ensemble Module

Following the dual-branch decoder’s output, our framework employs an integration learning strategy to synthesize the results, prioritizing the MLP branch’s output as the primary result and using the KAN branch’s output as a supplementary one. This strategic integration is designed to leverage the strengths of both branches and enhance the overall segmentation accuracy, as shown in Algorithm 1. The process begins by calculating the average Euclidean metric of the predicted boundary points for the segmentation target in each output. This calculation is crucial as it provides a measure of the precision of the segmentation predictions. By referring to the characteristics of segmentation error images, as illustrated in Figure 2, images with smaller average point distances are identified. In the next step, the framework compares the average point distances of the dual-branch prediction results. The rationale behind this comparison is to select the prediction that best captures the segmentation target’s boundary. The output with the larger average point distance is deemed to have a better delineation of the target’s edges and is, therefore, chosen as the final prediction result of the model. This method ensures that the final segmentation result is not only accurate but also robust, capable of handling variations in image complexity and segmentation difficulty. By prioritizing the output with a larger average point distance, the framework optimizes for a more precise and detailed segmentation, which is essential for tasks that require high levels of accuracy. At the same time, this output will also impact the subsequent SAM complementation module.

Algorithm 1 Dual-branch decoder output integration module.

Input: Output_MLP, Output_KAN, N.
Output: Output_Final.
procedure Integrate(Output_MLP, Output_KAN)
Calculate:
Dis_MLP ← the average Euclidean metric of predicted boundary points form Output_MLP.
Dis_KAN ← the average Euclidean metric of predicted boundary points form Output_KAN.
if Dis_MLP < N and Dis_MLP < Dis_KAN then
return Output_KAN ▹ Prioritize KAN output for better boundary delineation.
else
return Output_MLP ▹ Prioritize MLP output for better boundary delineation.
end if
end procedure

3.5. SAM-Based Segmentation Completion Module

Given the constraints that prevent fine-tuning the entire range of model parameters, our two-branch decoder integration occasionally faces challenges that result in subpar segmentation on certain datasets. To counter this, we have ensemble the segment anything model (SAM), leveraging its robust zero-sample learning capability to enhance the model’s segmentation outcomes. As depicted in Figure 6, the implementation involves a sequence of modules: a judgment module, a positive sample point generation module, the SAM module itself, and a final noise processing module. The SAM-based segmentation target complementary module initiates by acquiring the output results from the decoder to formulate the prompt data necessary for SAM’s operation. The process decides the type of prompt data based on the average point distance of the input results: (1) If the average point distance is small, only the prediction frame coordinates are used as the SAM prompt data. (2) If the average point distance is large, a combination of randomly selected positive sample points within the segmentation target and the prediction frame coordinates are used alongside the prediction frame coordinates as the SAM prompt data.

However, these supplementary results from SAM may contain errors or noise, as observed in Figure 7. These errors often manifest as multiple prediction blocks. To address this, we have developed a block counting module that utilizes a breadth-first search algorithm to tally the number of segmentation blocks present in SAM’s output results. The decision-making process is as follows: (1) If the count of segmentation blocks exceeds a predefined threshold, the SAM result is discarded, and the two-branch decoder’s prediction is taken as the primary segmentation result. (2) If the count does not surpass the threshold, SAM’s output is merged with the two-branch decoder’s prediction to formulate a new, refined segmentation outcome.

This approach ensures that while SAM provides valuable complementary information, its integration is conditional on the quality of its output, thereby maintaining the integrity and accuracy of the final segmentation result. This method exemplifies our commitment to creating a flexible and adaptive framework that maximizes the strengths of multiple models while mitigating their weaknesses.

4. Experiment

4.1. Experimental Settings and Evaluation Indicators

4.1.1. Experimental Settings

The model in this paper is constructed using PyTorch. The input image size of the model is $512 \times 512$ . The learning rate is set to 0.0001 for the fine-tuning of the KAN decoder branching module, and the number of training rounds is 100. During the training period, only the KAN decoder branching parameter is activated, while the parameters of the rest of the model modules are frozen. In the testing phase, the mean Euclidean metric threshold is set to 4, the positive point generation value is set to 2, and the SAM prediction noise module discard threshold is set to 15. In particular, the mean Euclidean metric threshold is set according to the results in Figure 8 and corresponds to N in Algorithms 1 and 2. For the overall model architecture, we use Swin-B as the visual encoder and BERT as the text encoder. Both the Transformer encoder and the two-branch decoder consist of six encoder layers and six decoder layers. Additionally, the Swin-B SAM model is used to correspond to the visual encoder.

Algorithm 2 Two-branch decoder integration with SAM.

Input: Decoder_output, Dis_output, N, Block_N
Output: Final_mask
procedure SAM( $D e c o d e r_o u t p u t$ )
$B b o x_p o i n t \leftarrow D c o d e r_o u t p u t$
if Dis_output < N then
Use only $B b o x_p o i n t$ for $p r o m p t_d a t a$
else
$P o s_p o i n t \leftarrow D e c o d e r_o u t p u t$
Use $B b o x_p o i n t$ and $P o s_p o i n t$ for $p r o m p t_d a t a$
end if
$S A M_o u t p u t \leftarrow S A M (p r o m p t_d a t a)$ ▹ Run SAM with prompt data
$b l o c k s_n u m \leftarrow$ CountBlocks $(S A M_o u t p u t)$ ▹ Count prediction blocks
if $b l o c k s_n u m > B l o c k_N$ then
$F i n a l_m a s k \leftarrow D e c o d e r_o u t p u t$
return $F i n a l_m a s k$ ▹ Discarding SAM results
else
$F i n a l_m a s k \leftarrow D e c o d e r_o u t p u t + S A M_o u t p u t$
return $F i n a l_m a s k$ ▹ Merge SAM results with decoder outputs
end if
end procedure

4.1.2. Evaluation Indicators

We use the mean intersection over union (mIoU) as the primary evaluation metric for referring image segmentation (RIS). For comparison with PolyFormer and other articles that use overall intersection over union (oIoU) results as an evaluation criterion, we also use oIoU as an evaluation metric for the model.

4.2. Dataset

In this study, we aimed to evaluate our referring image segmentation (RIS) model against existing architectures by testing it on three prominent RIS datasets: RefCOCO, RefCOCO+, and RefCOCOg. The RefCOCO dataset is composed of 142,209 annotated textual descriptions associated with 50,000 objects across 19,994 images. Similarly, RefCOCO+ also contains a substantial number of annotations, with 141,564 expressions for 49,856 objects in 19,992 images. A distinguishing feature of RefCOCO+ is the absence of positional words in the descriptions, which increases the complexity of the segmentation task by requiring the model to infer spatial relationships without explicit cues. Additionally, “Test A” and “Test B” sets in RefCOCO and RefCOCO+ contain only people and only non-people respectively. Expanding on this, RefCOCOg introduces an additional layer of complexity with 85,474 reference expressions for 54,822 objects within 26,711 images. The expressions in RefCOCOg were sourced through Amazon Mechanical Turk, resulting in longer and more intricate descriptions, averaging 8.4 words per expression compared to the 3.5 words in RefCOCO and RefCOCO+.

4.3. Main Results

Table 1 and Table 2 succinctly illustrate the comparative segmentation effectiveness of various classical models on these datasets. Our framework demonstrates a superior performance, surpassing current state-of-the-art (SOTA) models on almost all accounts. The results are a testament to the robustness and adaptability of our method, which excels in translating textual descriptions into precise image segmentations. Specifically, our framework’s performance in terms of both mean intersection over union (mIoU) and overall intersection over union (oIoU) is commendable. On the RefCOCO dataset, while our framework is slightly inferior to the S²RM model on the Test A dataset, we managed to exceed the scores achieved by the S²RM model on the val and Test B datasets, which has only recently been introduced as a competitor in the field. However, the S2RM model underperforms our model in all five datasets of the more challenging RefCOCO+ and RefCOCOg. More notably, our framework outperforms the S²RM model in all eight datasets in terms of the oIoU scores. This demonstrates that S²RM is slightly better on simple human segmentation datasets, but our framework is more robust on more comprehensive and complex referring image segmentation scenarios with more comprehensive capabilities.

The RefCOCO dataset serves as a foundational benchmark, and our model’s success here lays a solid foundation for its performance on more complex datasets. As we transition to the RefCOCO+ and RefCOCOg datasets, which are known for their increased complexity due to the lack of positional words and the intricacy of the reference expressions, our model continues to shine. It only misses PolyFormer’s oIoU score by only a slight 0.01% on the val dataset in RefCOCOg and exceeds the SOTA scores that PolyFormer has set for a long time in other datasets. On the other hand, on the mIoU score, which PolyFormer focuses more on, our framework achieves better performance than PolyFormer on all eight datasets.

In addition, as shown in Figure 9, Figure 10 and Figure 11, in the RefCOCO, RefCOCO+, and RefCOCOg datasets, for the images that PolyFormer fails to segment accurately, our framework is not only able to completely complement the segmentation results but even able to accurately segment the images that have completed incorrect segmentation. This achievement is a significant indicator of the powerful generalization capabilities of our framework, which can adapt to various levels of challenges presented by different datasets. In summary, our research introduces a novel RIS model that sets new benchmarks for segmentation accuracy. Its exceptional performance across a range of datasets underscores its potential for real-world applications, where the accurate segmentation of images based on textual descriptions is crucial.

4.4. Ablation Studies

4.4.1. Modeled Structural Ablation Experiment

In Table 3 of our paper, we meticulously detail the incremental performance improvements achieved by enhancing the PolyFormer model with our innovative additions. The table presents a clear breakdown of how each modification contributes to the overall efficacy of the model. Initially, we implemented the kernelized attention network (KAN) decoder branch as a standalone feature for segmentation prediction. This modification resulted in a modest improvement in performance on the RefCOCO and RefCOCO+ datasets, indicating its potential. However, we observed a slight decrease in performance on the RefCOCOg dataset, suggesting the need for further refinement. Our next step was to integrate the predictions from both the KAN and the existing multi-layer perceptron (MLP) decoder branches using our proprietary integration learning strategy. This integration led to a consistent improvement in model performance across all three datasets, showcasing the synergistic effect of combining the strengths of both decoders. The most significant leap in performance came with the introduction of the SAM-based complementation module. When this module was added on top of the two-branch decoder integration, the model achieved remarkable results. On the RefCOCO dataset, the model reached a mean intersection over union (mIoU) of 77.14% and an overall intersection over union (oIoU) of 75.37%. On the more complex RefCOCO+ and RefCOCOg datasets, the model delivered a mIoU of 71.75% and scores of 68.07% for oIoU, 70.72% for mIoU, and 67.75% for oIoU, respectively. These results are particularly noteworthy as they represent the best segmentation performance achieved with only a single decoder fine-tuned on a consumer-grade graphics card. This demonstrates not only the superior accuracy of our model but also its practicality and accessibility.

4.4.2. SAM Input Ablation Experiment

In Figure 6, we illustrate the design of the semantic attention model (SAM), highlighting its capability to accept a variety of input prompts, including bounding box coordinates, positive and negative sample points, and image masks. To investigate the optimal utilization of our model’s outputs for enhancing segmentation through SAM, we conducted a series of ablation experiments, the findings of which are presented in Table 4. To swiftly identify the most effective input format for SAM, we selected the RefCOCO+ dataset, which poses a greater challenge than RefCOCO, as our validation set. Initially, when we fed SAM solely with the target box coordinates derived from the dual-branch model’s predictions, SAM achieved a segmentation score of 67.46% for the mean intersection over union (mIoU) and 62.67% for the overall intersection over union (oIoU) on the RefCOCO+ val dataset; and only 71.50% and 61.06% (mIoU) and 68.91% and 54.85% (oIoU) on the Test A and Test B datasets. We then enriched the input prompt by incorporating randomly generated positive sample points alongside the target frame coordinates. This addition led to a noticeable improvement in SAM’s segmentation performance, with scores rising to 69.07% (mIoU) and 64.42% (oIoU). The performance also improved on the Test A and Test B datasets. Subsequently, introducing a noise detection and discard module further refined SAM’s predictions, yielding an enhanced val result of 70.81% (mIoU) and 67.34% (oIoU). Similarly, results of 74.48% (mIoU) and 72.51% (oIoU) were achieved on the Test A dataset, while results of 64.52% (mIoU) and 58.84% (oIoU) were achieved on the Test B dataset. Ultimately, by integrating SAM’s refined predictions as complementary information back into the dual-branch model, the overall framework achieved a remarkable segmentation performance of 71.75% (mIoU) and 68.07% (oIoU) on the val dataset, 75.70% (mIoU) and 73.46% (oIoU) on the Test A dataset, and 65.69% (mIoU) and 59.47% (oIoU) on the Test B dataset, thereby attaining the SOTA results. This outcome underscores the significance of each component in the framework and the synergistic impact of their integration, propelling the model to greater segmentation accuracy.

5. Conclusions

In this paper, we introduce a novel hybrid framework for referring image segmentation: a dual-decoder model with SAM complementation. This framework is designed to advance cross-modal fusion capabilities, enhancing the understanding of image content at a granular level. Building upon the sophisticated architecture of the PolyFormer model, our framework incorporates a KAN decoder branch and a multi-scale feature fusion module. These additions significantly bolster the model’s capacity to interpret fine details within images. The robustness of our model is further fortified by an ensemble learning strategy that harmonizes the prediction outcomes from both the MLP and KAN decoder branches. A pivotal innovation of our framework involves the utilization of PolyFormer’s output as input cues for SAM. This approach leverages SAM’s potent zero-sample learning capabilities to refine and optimize segmentation results, notably enhancing the model’s precision in complex visual scenes. Our experimental findings, conducted on widely recognized public datasets, substantiate the efficacy of our methodology. The framework not only achieves SOTA results across the RefCOCO, RefCOCO+, and RefCOCOg datasets but also—through ablation studies—elucidates the distinct contributions of each component to enhancing the model’s overall performance.

Author Contributions

Conceptualization, H.C., S.Z. and J.H.; Methodology, H.C.; Software, S.Z.; Validation, J.H.; Formal analysis, H.C.; Investigation, H.C.; Resources, S.Z., K.L., J.Y. and J.H.; Writing—original draft, H.C.; Writing—review & editing, S.Z.; Visualization, H.C.; Supervision, K.L., J.Y. and J.H.; Project administration, S.Z. and J.H.; Funding acquisition, J.Y. and J.H. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Illustration of the referring image segmentation task, which in our framework will output the bounding box and split masks.

View Image - Figure 2. In the images above, we see some of the segmentation errors from PolyFormer. These images exhibit the following characteristics: (1) the segmentation target position prediction is basically correct, but the edge point coordinates of the segmentation target cannot be successfully predicted; (2) the Euclidean metric between the edge point coordinates of the predicted segmentation target is very small.

Figure 2. In the images above, we see some of the segmentation errors from PolyFormer. These images exhibit the following characteristics: (1) the segmentation target position prediction is basically correct, but the edge point coordinates of the segmentation target cannot be successfully predicted; (2) the Euclidean metric between the edge point coordinates of the predicted segmentation target is very small.

View Image - Figure 3. Theoverall structure of our hybrid framework for referring image segmentation: dual-decoder model with SAM complementation.

Figure 3. Theoverall structure of our hybrid framework for referring image segmentation: dual-decoder model with SAM complementation.

View Image - Figure 4. Schematic of our KAN decoder branch training. During training, we freeze all parameters except for the KAN decoder branch and only use the KAN decoder predictions as output.

Figure 4. Schematic of our KAN decoder branch training. During training, we freeze all parameters except for the KAN decoder branch and only use the KAN decoder predictions as output.

View Image - Figure 5. The framework of the KAN decoder. It used a different prediction network and combined more feature information in the decoder input section.

Figure 5. The framework of the KAN decoder. It used a different prediction network and combined more feature information in the decoder input section.

View Image - Figure 6. The framework of the SAM-based segmentation completion module. It selects the high-quality information from dual-decoder prediction results as the prompt input for SAM and obtains a new prediction mask based on this prompt.

Figure 6. The framework of the SAM-based segmentation completion module. It selects the high-quality information from dual-decoder prediction results as the prompt input for SAM and obtains a new prediction mask based on this prompt.

View Image - Figure 7. The above image is the segmentation result with noise predicted by SAM. The inputs to SAM are the top left and bottom right coordinates of the segmentation target bounding box or the coordinates of the bounding box and the coordinates of the randomly generated positive points. The green box is the bounding box predicted by our framework’s decoder. The green points are the randomly generated positive points, and the red arrow points to the segmentation noise predicted by SAM.

Figure 7. The above image is the segmentation result with noise predicted by SAM. The inputs to SAM are the top left and bottom right coordinates of the segmentation target bounding box or the coordinates of the bounding box and the coordinates of the randomly generated positive points. The green box is the bounding box predicted by our framework’s decoder. The green points are the randomly generated positive points, and the red arrow points to the segmentation noise predicted by SAM.

View Image - Figure 8. Dis_N is the average Euclidean metric of the coordinate points of the edge of the predicted segmentation target in the image, and mIoU is the mIoU result of the corresponding image prediction result with Ground Truth. When Dis_N is less than 4, most of the predicted segmentation results are wrong.

Figure 8. Dis_N is the average Euclidean metric of the coordinate points of the edge of the predicted segmentation target in the image, and mIoU is the mIoU result of the corresponding image prediction result with Ground Truth. When Dis_N is less than 4, most of the predicted segmentation results are wrong.

View Image - Figure 9. The result comparison of PolyFormer and our framework on the RefCOCO val set. The green box represents the segmentation target bounding box. The green points represent the positive points input to the SAM. The red points represent the segmentation target edge coordinate points predicted by the decoder. The red mask represents the final predicted segmentation mask of our framework. Our framework could perform accurate segmentation complementation on decoder-predicted basis.

Figure 9. The result comparison of PolyFormer and our framework on the RefCOCO val set. The green box represents the segmentation target bounding box. The green points represent the positive points input to the SAM. The red points represent the segmentation target edge coordinate points predicted by the decoder. The red mask represents the final predicted segmentation mask of our framework. Our framework could perform accurate segmentation complementation on decoder-predicted basis.

View Image - Figure 10. The result comparison of PolyFormer and our framework on RefCOCO+ val set. The green box represents the segmentation target bounding box. The green points represent the positive points input to the SAM. The red points represent the segmentation target edge coordinate points predicted by the decoder. The red mask represents the final predicted segmentation mask of our framework. Our framework could perform accurate segmentation complementation on a decoder-predicted basis.

Figure 10. The result comparison of PolyFormer and our framework on RefCOCO+ val set. The green box represents the segmentation target bounding box. The green points represent the positive points input to the SAM. The red points represent the segmentation target edge coordinate points predicted by the decoder. The red mask represents the final predicted segmentation mask of our framework. Our framework could perform accurate segmentation complementation on a decoder-predicted basis.

View Image - Figure 11. The result comparison of PolyFormer and our framework on RefCOCOg val set. The green box represents the segmentation target bounding box. The green points represent the positive points input to the SAM. The red points represent the segmentation target edge coordinate points predicted by the decoder. The red mask represents the final predicted segmentation mask of our framework. Our framework could perform accurate segmentation complementation on a decoder-predicted basis.

Figure 11. The result comparison of PolyFormer and our framework on RefCOCOg val set. The green box represents the segmentation target bounding box. The green points represent the positive points input to the SAM. The red points represent the segmentation target edge coordinate points predicted by the decoder. The red mask represents the final predicted segmentation mask of our framework. Our framework could perform accurate segmentation complementation on a decoder-predicted basis.

Table 1

Comparison of the mIoU score with the state-of-the-art methods on three referring image segmentation benchmarks.

Method	RefCOCO			RefCOCO+			RefCOCOg
Method	val	Test A	Test B	val	Test A	Test B	val	Test
DMN₁₈ [13]	49.78	54.83	45.13	38.88	44.22	32.29	-	-
MCN₂₀ [38]	62.44	64.20	59.71	50.62	54.99	44.69	49.22	49.40
CGAN₂₀ [39]	64.86	68.04	62.07	51.03	55.51	44.06	51.01	51.69
LTS₂₁ [40]	65.43	67.76	63.08	54.21	58.32	48.02	54.40	54.25
VLT₂₁ [10]	65.65	68.29	62.73	55.50	59.20	49.36	52.99	56.65
RefTrans₂₁ [41]	74.34	76.77	70.87	66.75	70.58	59.40	66.63	67.39
LAVT₂₂ [32]	74.46	76.89	70.94	65.81	70.97	59.23	63.34	63.62
M3Att₂₃ [42]	73.60	76.23	70.36	65.34	70.50	56.98	64.92	67.37
SADLR₂₃ [33]	76.52	77.98	73.49	68.94	72.71	61.10	67.47	67.73
PolyFormer₂₃ [20]	75.96	77.09	73.22	70.65	74.51	64.64	69.36	69.88
$S^{2} {RM}_{24}$ [43]	76.88	78.43	74.01	70.01	73.81	63.21	69.02	68.49
Ours	77.14	78.33	74.33	71.75	75.70	65.69	70.72	71.43

Table 2

Comparison of the oIoU score with the state-of-the-art methods on three referring image segmentation benchmarks.

Method	RefCOCO			RefCOCO+			RefCOCOg
Method	val	Test A	Test B	val	Test A	Test B	val	Test
RMI+DCRF₁₇ [6]	45.18	45.69	45.57	29.86	30.48	29.50	-	-
RRN₁₈ [5]	55.33	57.26	53.93	39.75	42.15	36.11	-	-
MAttNet ₁₈ [44]	56.51	62.37	51.70	46.67	52.39	40.08	47.64	48.61
STEP₁₉ [9]	60.04	63.46	57.97	48.19	52.33	40.41	-	-
CMSA+DCRF₁₉ [15]	58.32	60.61	55.09	43.76	47.60	37.89
CMPC+DCRF₂₀ [7]	61.36	64.53	59.64	49.56	53.44	43.23	-	-
LSCM+DCRF₂₀ [8]	61.47	64.99	59.55	49.34	53.12	43.50	-
SANet ₂₁ [45]	61.84	64.95	57.43	50.38	55.36	42.74	-	-
BRINet+DCRF₂₀ [12]	61.35	63.37	59.57	48.57	52.87	42.13	48.04	-
CEFNet ₂₁ [11]	62.76	65.69	59.67	51.50	55.24	43.01	-	-
ReSTR₂₂ [46]	67.22	69.30	64.45	55.78	60.44	48.27	-	-
ISPNet ₂₂ [47]	65.19	68.45	62.73	52.70	56.77	46.39	53.00	50.08
CRIS₂₂ [48]	70.47	73.18	66.10	62.27	68.08	53.68	59.87	60.36
LAVT₂₂ [32]	72.73	75.82	68.79	62.14	68.38	55.10	61.24	62.09
FSFINet₂₃ [49]	71.23	74.34	68.31	60.84	66.49	53.24	61.51	61.78
SADLR₂₃ [33]	74.24	76.25	70.06	64.28	69.09	55.19	63.60	63.56
PolyFormer₂₃ [20]	74.82	76.64	71.06	67.64	72.89	59.33	67.76	69.05
$S^{2} {RM}_{24}$ [43]	74.35	76.57	70.44	65.39	70.63	57.33	65.37	65.30
Ours	75.37	77.20	71.38	68.07	73.46	59.47	67.75	69.50

Table 3

Comparison of mIoU scores and oIoU scores with different model structures on RefCOCO, RefCOCO+, and RefCOCOg val datasets.

Method	RefCOCO		RefCOCO+		RefCOCOg
Method	mIoU	oIoU	mIoU	oIoU	mIoU	oIoU
PolyFormer₂₃ [20]	75.96	74.82	70.65	67.64	69.36	67.76
+ KAN decoder	76.04	74.98	70.69	67.83	69.29	67.36
+ Two-decoder	76.08	74.97	70.78	67.79	69.44	67.40
+ SAM complement	77.14	75.37	71.75	68.07	70.72	67.75

Table 4

Comparison of mIoU scores and oIoU scores for different SAM module operations on the RefCOCO+ referring image segmentation dataset. BB denotes the bounding box input. PP denotes Pos point input. DN denotes the denoising operation. CP denotes the complement operation.

	SAM Operation			val		Test A		Test B
BB	PP	DN	CP	mIoU	oIoU	mIoU	oIoU	mIoU	oIoU
✓	✗	✗	✗	67.46	62.67	71.50	68.91	61.06	54.85
✓	✓	✗	✗	69.07	64.42	72.75	69.67	62.96	56.44
✓	✓	✓	✗	70.81	67.34	74.48	72.51	64.52	58.84
✓	✓	✓	✓	71.75	68.07	75.70	73.46	65.69	59.47

References

1. Chen, S.; Zhao, Y.; Jin, Q.; Wu, Q. Fine-Grained Video-Text Retrieval with Hierarchical Graph Reasoning. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 14–19 June 2020.

2. Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA, 21–26 July 2017.

3. Nie, W.; Yu, Y.; Zhang, C.; Song, D.; Zhao, L.; Bai, Y. Temporal-Spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care Unit. IEEE Trans. Bio-Med. Eng.; 2024; 71, pp. 583-595. [DOI: https://dx.doi.org/10.1109/TBME.2023.3309956] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37647192]

4. Hu, R.; Darrell, M.R.T. Segmentation from Natural Language Expressions. Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 108-124.

5. Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring Image Segmentation via Recurrent Refinement Networks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018.

6. Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent Multimodal Interaction for Referring Image Segmentation. Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA, 22–25 July 2017.

7. Huang, S.; Hui, T.; Liu, S.; Li, G.; Wei, Y.; Han, J.; Liu, L.; Li, B. Referring Image Segmentation via Cross-Modal Progressive Comprehension. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 13–19 June 2020.

8. Hui, T.; Liu, S.; Huang, S.; Li, G.; Yu, S.; Zhang, F.; Han, J. Linguistic Structure Guided Context Modeling for Referring Image Segmentation. Proceedings of the Computer Vision—ECCV 2020; Glasgow, UK, 23–28 August 2020.

9. Chen, D.J.; Jia, S.; Lo, Y.C.; Chen, H.T.; Liu, T.L. See-Through-Text Grouping for Referring Image Segmentation. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); Seoul, Republic of Korea, 27 October–2 November 2019.

10. Ding, H.; Liu, C.; Wang, S.; Jiang, X. VLT: Vision-Language Transformer and Query Generation for Referring Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.; 2023; 45, pp. 7900-7916. [DOI: https://dx.doi.org/10.1109/TPAMI.2022.3217852] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36306296]

11. Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA, 20–25 June 2021.

12. Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-Directional Relationship Inferring Network for Referring Image Segmentation. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 13–19 June 2020.

13. Margffoy-Tuay, E.; Pérez, J.C.; Botero, E.; Arbeláez, P. Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries. Proceedings of the Computer Vision—ECCV 2018; Munich, Germany, 8–14 September 2018.

14. Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-Word-Aware Network for Referring Expression Image Segmentation. Proceedings of the Computer Vision—ECCV 2018; Munich, Germany, 8–14 September 2018.

15. Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA, 15–20 June 2019.

16. Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W. et al. Hybrid Task Cascade for Instance Segmentation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA, 15–20 June 2019.

17. Dai, J.; He, K.; Sun, J. Instance-Aware Semantic Segmentation via Multi-task Network Cascades. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016.

18. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; pp. 386-397. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2844175] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29994331]

19. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation(Conference Paper). Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.; 2018; pp. 8759-8768.

20. Liu, J.; Ding, H.; Cai, Z.; Zhang, Y.; Satzoda, R.K.; Mahadevan, V.; Manmatha, R. PolyFormer: Referring Image Segmentation as Sequential Polygon Generation. Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Vancouver, BC, Canada, 17–24 June 2023.

21. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision; Zurich, Switzerland, 6–12 September 2014.

22. Lazarow, J.; Xu, W.; Tu, Z. Instance Segmentation with Mask-supervised Polygonal Boundary Transformers. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA, 18–24 June 2022.

23. Acuna, D.; Ling, H.; Kar, A.; Fidler, S. Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++(Conference Paper). Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.; 2018; pp. 859-868.

24. Castrejon, L.; Kundu, K.; Urtasun, R.; Fidler, S. Annotating Object Instances with a Polygon-RNN. Comput. Vis. Pattern Recognit.; 2017; pp. 4485-4493.

25. Liang, J.; Homayounfar, N.; Ma, W.C.; Xiong, Y.; Hu, R.; Urtasun, R. PolyTransform: Deep Polygon Transformer for Instance Segmentation. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 13–19 June 2020.

26. Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. PolarMask: Single Shot Instance Segmentation With Polar Representation. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020; Seattle, WA, USA, 13–19 June 2020.

27. Yu, L.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling Context in Referring Expressions. Proceedings of the Computer Vision—ECCV 2016. 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016.

28. Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.; Murphy, K. Generation and Comprehension of Unambiguous Object Descriptions. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016.

29. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv; 2024; arXiv: 2404.19756

30. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. et al. Segment Anything. arXiv; 2023; arXiv: 2304.02643

31. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv; 2015; arXiv: 1409.1556

32. Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA, 18–24 June 2022.

33. Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H.S. Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation. arXiv; 2023; arXiv: 2303.06345[DOI: https://dx.doi.org/10.1609/aaai.v37i3.25428]

34. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv; 2023; arXiv: 1706.03762

35. Chen, T.; Saxena, S.; Li, L.; Fleet, D.J.; Hinton, G. Pix2seq: A Language Modeling Framework for Object Detection. arXiv; 2022; arXiv: 2109.10852

36. Chen, T.; Saxena, S.; Li, L.; Lin, T.Y.; Fleet, D.J.; Hinton, G.E. A unified sequence interface for vision tasks. Adv. Neural Inf. Process. Syst.; 2022; 35, pp. 31333-31346.

37. Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. Proceedings of the International Conference on Machine Learning. PMLR; Baltimore, MD, USA, 17–23 July 2022; pp. 23318-23340.

38. Luo, G.; Zhou, Y.; Sun, X.; Cao, L.; Wu, C.; Deng, C.; Ji, R. Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 13–19 June 2020.

39. Luo, G.; Zhou, Y.; Ji, R.; Sun, X.; Su, J.; Lin, C.W.; Tian, Q. Cascade Grouped Attention Network for Referring Expression Segmentation. Proceedings of the MM ’20: Proceedings of the 28th ACM International Conference on Multimedia; Seattle, WA, USA, 12–16 October 2020.

40. Jing, Y.; Kong, T.; Wang, W.; Wang, L.; Li, L.; Tan, T. Locate then Segment: A Strong Pipeline for Referring Image Segmentation. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA, 20–25 June 2021.

41. Li, M.; Sigal, L. Referring Transformer: A One-step Approach to Multi-task Visual Grounding. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 19652-19664.

42. Liu, C.; Ding, H.; Zhang, Y.; Jiang, X. Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation. IEEE Trans. Image Process.; 2023; 32, pp. 3054-3065. [DOI: https://dx.doi.org/10.1109/TIP.2023.3277791] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37220044]

43. Yang, J.; Zhang, L.; Sun, J.; Lu, H. Spatial Semantic Recurrent Mining for Referring Image Segmentation. arXiv; 2024; arXiv: 2405.09006

44. Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018.

45. Lin, L.; Yan, P.; Xu, X.; Yang, S.; Zeng, K.; Li, G. Structured Attention Network for Referring Image Segmentation. IEEE Trans. Multimed.; 2022; 24, pp. 1922-1932. [DOI: https://dx.doi.org/10.1109/TMM.2021.3074008]

46. Kim, N.; Kim, D.; Kwak, S.; Lan, C.; Zeng, W. ReSTR: Convolution-free Referring Image Segmentation Using Transformers. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA, 18–24 June 2022.

47. Liu, C.; Jiang, X.; Ding, H. Instance-Specific Feature Propagation for Referring Segmentation. IEEE Trans. Multimed.; 2023; 25, pp. 3657-3667. [DOI: https://dx.doi.org/10.1109/TMM.2022.3163578]

48. Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. CRIS: CLIP-Driven Referring Image Segmentation. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA, 18–24 June 2022.

49. Yang, J. 13–19 June 202 Zhang, L.; Lu, H. Referring Image Segmentation with Fine-Grained Semantic Funneling Infusion. IEEE Trans. Neural Netw. Learn. Syst.; 2023; pp. 1-12. [DOI: https://dx.doi.org/10.1109/TNNLS.2023.3342462] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38117627]

Word count: 8296

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

In the realm of human–robot interaction, the integration of visual and verbal cues has become increasingly significant. This paper focuses on the challenges and advancements in referring image segmentation (RIS), a task that involves segmenting images based on textual descriptions. Traditional approaches to RIS have primarily focused on pixel-level classification. These methods, although effective, often overlook the interconnectedness of pixels, which can be crucial for interpreting complex visual scenes. Furthermore, while the PolyFormer model has shown impressive performance in RIS, its large number of parameters and high training data requirements pose significant challenges. These factors restrict its adaptability and optimization on standard consumer hardware, hindering further enhancements in subsequent research. Addressing these issues, our study introduces a novel two-branch decoder framework with SAM (segment anything model) for RIS. This framework incorporates an MLP decoder and a KAN decoder with a multi-scale feature fusion module, enhancing the model’s capacity to discern fine details within images. The framework’s robustness is further bolstered by an ensemble learning strategy that consolidates the insights from both the MLP and KAN decoder branches. More importantly, we collect the segmentation target edge coordinates and bounding box coordinates as input cues for the SAM model. This strategy leverages SAM’s zero-sample learning capabilities to refine and optimize the segmentation outcomes. Our experimental findings, based on the widely recognized RefCOCO, RefCOCO+, and RefCOCOg datasets, confirm the effectiveness of this method. The results not only achieve state-of-the-art (SOTA) performance in segmentation but are also supported by ablation studies that highlight the contributions of each component to the overall improvement in performance.

Details

Title

A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation

Author

Chen, Haoyuan¹; Zhou, Sihang²; Li, Kuan³

; Yin, Jianping³; Huang, Jian²

¹ College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China; [email protected] (H.C.); [email protected] (S.Z.); School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China; [email protected] (K.L.); [email protected] (J.Y.)
² College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China; [email protected] (H.C.); [email protected] (S.Z.)
³ School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China; [email protected] (K.L.); [email protected] (J.Y.)

First page

3061

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math12193061

ProQuest document ID

3116656592