Introduction
The continuous development of information technology has made social media a vital part of people's daily lives [1,2]. Platforms like Sina Weibo, Twitter, and Facebook enable users to share their thoughts and experiences and play an essential role in forming and spreading social opinions [3]. Sentiment analysis helps analyze individual behavior and predict changes in public emotions, which can more objectively reflect the actual situation of online public opinion [4]. Implementing sensible regulatory policies can help governments curb the spread of false information and harmful content, mitigate the adverse effects of public opinion, create an orderly online environment, and maintain social stability [5,6]. Hence, it is significant to conduct sentiment analysis on social media.
In Natural Language Processing (NLP), sentiment analysis is critical. It focuses on extracting and analyzing subjective information from data to understand users' emotions, attitudes, and psychological tendencies [7]. With social media's rapid development, sentiment analysis has been extensively applied to different domains, such as tourism, medicine, and education. In the tourism industry, sentiment analysis can evaluate tourists' satisfaction and needs, improving service quality. In the medical field, it aids doctors in understanding patients' psychological states, enabling more effective treatment of mental issues like depression. In education, sentiment analysis helps teachers comprehend students' emotional changes, allowing them to adjust teaching strategies and enhance educational outcomes [8,9].
The development of sentiment analysis has progressed from traditional unimodal to multimodal methods [10]. Early sentiment analysis primarily relies on the text-based sentiment lexicon to identify emotional vocabulary within texts [11–18]. With the rise of machine learning and deep learning, sentiment analysis has achieved higher accuracy and efficiency, including multimodal data analysis such as images and sounds. In image sentiment analysis, researchers employ computer vision techniques to recognize facial expressions and body postures to confirm emotions [19–22]. In audio sentiment analysis, scholars utilize voice features such as tone, speed, and volume to determine speakers' emotions. For instance, a rising tone indicates happiness, while faster speech suggests anger [23,24]. However, unimodal features enable noise to be amplified during fusion, enhancing unimodal features as a critical issue. Furthermore, textual data significantly impacts sentiment analysis more than non-textual data. Thus, enhancing text features remains a primary research focus.
Due to the differences and inconsistencies in multimodal data, modality fusion has become an essential aspect of multimodal sentiment analysis [25]. Advances in deep learning have driven the development of multimodal fusion techniques, such as the Attention Mechanism, Long Short-Term Memory Network (LSTM), and Graph Neural Network (GNN), providing strong support for sentiment analysis [26,27,53]. By integrating various modalities, these methods are evident to improve the accuracy of sentiment analysis. However, multimodal fusion still needs to overcome many challenges. On the one hand, the heterogeneity and complexity of various modalities increase the difficulty of data processing. On the other hand, effectively removing noise and extracting useful features during fusion remains a problem that needs to be addressed.
This paper proposes the SECIF model to address these issues. The model comprises two core modules: a semantic enhancement module and a cross-modal interaction fusion module. The semantic enhancement module aims to improve feature extraction capabilities using the proposed GMHA mechanism and ICN module. The GMHA mechanism aggregates important semantic information and reduces interference from noise. The ICN module captures complex contextual dependencies and enhances the ability of text feature representations. The cross-modal interaction fusion module minimizes the differences and inconsistencies between different modalities. The efficiency and advantages of the SECIF model in social media sentiment analysis are validated using a self-scraped Sina Weibo public opinion dataset and two publicly available datasets.
The main contributions of this paper are as follows
1. Firstly, a novel sentiment analysis model called SECIF is established to bridge the semantic representations between text and images for more accurate results.
2. Secondly, the proposed GMHA mechanism aggregates important semantic information, reducing interference from noise.
3. Thirdly, the created ICN module captures complex contextual dependencies, enhancing the capability of text feature representations.
4. Finally, experiments are conducted using a self-scraped Sina Weibo public opinion dataset and two publicly available datasets to verify the accuracy and dependability of the model in social media sentiment analysis, further demonstrating the superiority of the SECIF model.
Related work
This section provides a comprehensive review and discussion of existing literature on unimodal and multimodal sentiment analysis methods.
Unimodal sentiment analysis
Text-based sentiment analysis.
Text sentiment analysis aims to extract emotional information from textual content through parsing and processing. It achieves a deep understanding and precise expression of sentiment. Text sentiment analysis can currently be classified into three primary approaches: lexicon-based, machine learning-based, and deep learning-based methods.
Lexicon-based methods depend on pre-constructed sentiment dictionaries to match the polarity of words in the text. It calculates the overall sentiment inclination through counting or weighted summation of every word in the content. Nguyen, Bermudez, and Yan et al. developed sentiment lexicons for Vietnamese, Spanish, and Tibetan to support sentiment analysis in different languages [12–14]. Similarly, Mu, Jiang, and Liu et al. created lexicons for finance, music, and photography to address the specific sentiment analysis needs within those fields [15–17]. Zhao et al. collected approximately 146 million Weibo texts to construct a large-scale sentiment lexicon. The lexicon contains about 100,000 words for sentiment analysis using simple text analysis methods [18]. However, sentiment lexicons often fail to cover all emotional expressions due to the continuous emergence of new words and the ambiguity of polysemous words.
Machine learning-based methods utilize labeled training data to build models for predicting text sentiment. Zhang et al. introduced a text sentiment analysis method using three-layer granularity, employing a three-branch decision region and support vector machine (SVM) for three-class classification [28]. Tang et al. integrated conditional random fields and dependency syntax rules to extract features for sentiment analysis [29]. Yang et al. utilized the Latent Dirichlet Allocation (LDA) topic model to extract eight environmental themes from a Weibo dataset and applied the eXtreme Gradient Boosting (XGBoost) ensemble model for sentiment analysis [30]. Zeng et al. applied LDA to extract topic distribution features from Weibo texts and employed the Adaptive Boosting (AdaBoost) ensemble classifier to develop the sentiment analysis model [31]. However, machine learning-based methods often rely on vast labeled data, are sensitive to noise, and struggle with handling long-distance dependencies.
Deep learning-based methods transform text into vector representations and build neural network models to capture context semantic information for predicting sentiment inclination. Deep neural network models are widely used in sentiment analysis because they effectively capture long-distance dependencies in text [32–34]. Jia et al. reconstructed weighted adjacency matrices for semantic and syntactic graphs, improving the accuracy of aspect-level sentiment analysis [54]. Lu et al. employed GNN to model intra- and inter-modal information to enhance dialogue emotion recognition [55]. Qian et al. used dual BERT models and a dynamic routing algorithm to capture and fuse contextual relationships [35]. The application of deep learning methods in sentiment analysis has become widespread, driving progress through emerging technologies.
Image-based sentiment analysis.
Image feature extraction plays a crucial role in computer vision. It aims to identify essential features within images or video frames and lays the foundation for subsequent image processing and analysis. Image sentiment analysis is divided into two main approaches: machine learning-based and deep learning-based methods.
Machine learning-based methods extract visual features from images, including shape, color, and texture, for training and recognizing emotions. Yanulevskaya et al. applied SVM to extract Wicce’s and Gabor’s features from images for emotion prediction [58]. Wang et al. employed Logistic Regression (LR) and SVM to predict the sentiments of social media images [19]. Gherkar et al. used SVM to analyze emotions in users' facial photos in restaurant reviews [20]. However, these methods can bridge the semantic representations between basic visual features and complex emotional semantics, reducing the accuracy of sentiment analysis.
Deep learning-based methods simulate the human neural network to extract emotional features in images, effectively addressing the semantic gap. Convolutional Neural Network (CNN) has gained widespread attention as the core method for image feature recognition. Recently, deep CNN architectures like VGG16, VGG19, Xception, ResNet, and ViT have achieved remarkable success in computer vision. Fan et al. integrated VGG19 and Xception models at the feature, intermediate, decision, and hybrid levels to analyze the visual sentiment of public opinion networks [21]. Wang et al. developed a model based on ResNet that combines CBAM with non-local modules for visual sentiment analysis [22]. Yang et al. integrated ResNet and Transformer to extract local and global image features [59]. Additionally, tools like OpenFace2.0 [36] and Facet [37] have been employed in the recognition of facial expressions in images, enhancing the accuracy and depth of image sentiment analysis [38].
Multimodal sentiment analysis
In real-world learning and expression, humans rely on multiple senses. In sentiment analysis, it is essential to combine text, images, and other multimodal data [10,23,24,26,27,56,57]. Since the modal information is complementary and interrelated, integrating information from various modalities improves sentiment analysis precision, enabling a more thorough understanding and articulation of emotions [39].
Multimodal fusion is the key to sentiment analysis. It aims to extract and combine essential details from multiple modalities, reduce redundant noise information, and achieve effective interaction among modalities. Early scholars adopt feature fusion and decision fusion methods. Feature fusion enhances model expressiveness by encoding features from various modalities, and decision fusion improves model accuracy by integrating independent decision results from other modalities [40–43]. Several scholars have advanced the development of multimodal sentiment analysis methods by adopting different technical approaches. Zeng et al. used heterogeneous graphs to integrate knowledge and achieve feature fusion from multiple sources [53]. Yang et al. applied contrastive representation learning and contrastive feature decomposition to improve the representation of multimodal information [56]. Zeng et al. applied a dynamic routing network to capture the consistency and difference features by fusing visual and audio features with text modality as the center [57].
Recently, more researchers have used attention mechanisms for modality fusion. It effectively captures the relevance and importance of features across various modalities by assigning different weights to them, thereby enhancing multimodal fusion [44,45]. Hu et al. used the joint attention mechanism to identify significant regions consistent with text and images. They also employed the interactive attention mechanism to focus on feature interaction between different modalities and achieved an effective multimodal feature fusion [46]. Li et al. implemented six-modal information interaction through the cross-modal interaction mechanism to enhance single-modal features. They used the multi-head self-attention mechanism to calculate semantic relevance between original and enhanced features, improving the ability to recognize emotional features [47]. Luo et al. concentrated on learning common feature representations of modalities. They used the cross-attention mechanism to allow each modality to gain auxiliary information from the features of other modalities [48]. In the fusion module, scholars utilize the attention mechanism to weigh and connect the emotional semantic consistent features of various modalities, enhancing the expression ability of modalities and suppressing the negative impact of weak modalities.
Methods
This section provides a detailed and comprehensive overview of the proposed SECIF model. Fig 1 displays the model's architecture. The model comprises four key components: feature extraction module, semantic enhancement module, cross-modal interaction fusion module, and sentiment prediction module. Each sample is divided into text and image modalities. The initial feature vectors are extracted for the text and images within the feature extraction layer. In the semantic enhancement layer, the textual feature vectors are processed by the innovative GMHA mechanism and the ICN module, which capture essential feature information and complex contextual dependencies, extending the capabilities of textual feature representations. The cross-modal interaction fusion module combines enhanced text and image features to achieve alignment and interaction across different modalities. Finally, we complete the sentiment analysis task.
[Figure omitted. See PDF.]
Feature extraction
Text feature representation.
Pre-trained language models are essential to extracting semantic features from text in NLP. Unlike Word2Vec, BERT addresses polysemy by utilizing contextual information and hierarchical learning. It acquires multi-level semantic features and provides rich feature options for downstream tasks. In multi-modal sentiment analysis, BERT is widely used to extract textual features. Specifically, given a text sequence S, tokens [CLS] and [SEP] are respectively inserted at the sequence's start and end of the sequence, denoted as S = [CLS, t1, t2,..., tn, SEP]. The tokenized text sequence is passed through BERT’s embedding layer. The layer outputs the sum of token, segment, and positional embeddings. Equation (1) is used to calculate the text feature representation.
(1)
Where S is the text sequence, denotes the parameters for extracting text features, and HT indicates the feature representation of the text modality.
Image feature representation.
As a deep convolutional neural network, ResNet successfully solves the gradient vanishing and explosion problems in the traditional methods. The residual mechanism enables the network to learn the variation between the input and the desired output rather than the entire mapping. Hence, we employed the pre-trained ResNet101 model as the image encoder to extract high-level feature representations from images. Input images are resized to 768 × 768 and uniformly divided into 16 × 16 regions. The pre-trained ResNet101 model processes the original photos to obtain feature vectors. Equation (2) is used to calculate the image feature representation.
(2)
Where I indicate the original image, denotes the image feature's extraction parameters, and HI is the image modality's feature representation.
Semantic enhancement
The proposed semantic enhancement module comprises the GMHA mechanism and the ICN module. The attention mechanism captures feature relationships and addresses information fragmentation and redundancy [34]. This paper proposes a GMHA mechanism by incorporating the Gather mechanism into the Multi-Head Attention (MHA) mechanism, effectively integrating information across different attention heads. The enhancement improves the model's complex data-handling ability and reduces interference noise. Capsule Network (CapsNet) is an emerging neural network architecture that captures hierarchical and spatial relationships of input data through the capsule layer structure. The approach allows them to represent complex syntactic and semantic relationships while reducing information loss from pooling operations. We create an ICN module that uses capsule vectors to capture complex contextual dependencies better, enhancing text feature representations' capability. Ultimately, this paper combines the GMHA mechanism and ICN module to improve feature extraction efficiency.
Propose a GMHA mechanism
The MHA mechanism is widely applied in NLP, including machine translation, text classification, and language modeling. The input sequence is processed by multiple attention heads in the MHA mechanism. Each head generates an attention matrix that focuses on different aspects of the sequence. The attention matrices allow the model to capture various information, enabling simultaneous parallel processing. The MHA mechanism enhances the models' capacity to capture and aggregate semantic information within the input sequence, improving classification accuracy.
However, the existing MHA mechanism still needs help in feature quality, noise reduction, and data overfitting. This paper proposes a Gathered Multi-Head Attention (GMHA) mechanism to address these problems. The GMHA mechanism helps the MHA mechanism aggregate important feature information, reduce noise interference, and enhance effectiveness. In NLP, the Gather mechanism extracts critical information from text sequences by assigning weights based on word importance, filtering out the most representative elements. Specifically, the Gather mechanism assigns weights based on each word's importance in the text, sorts them, and selects the top K words with the highest weights as the critical information extraction results. Equation (3) is used to calculate the GMHA mechanism.
(3)
Where G represents the Gather operation, and , , and denote the projections of the input text sequence into query, key, and value spaces, respectively.
Create an ICN module.
In processing data, CNN fails to capture spatial relationships between low-level objects because the scalar values passed from one layer to the next cannot represent the relationships between low-level and high-level features. To address the limitations of traditional CNN, Hinton et al. proposed a new neural network model called CapsNet [49]. Unlike CNN, the vectors in CapsNet capture local features, which helps retain more feature information and spatial relationships. CapsNet applies dynamic routing algorithms to overcome the shortcomings of pooling layers and avoid losing feature information. It also captures complex syntactic and semantic relationships, which is advantageous for modeling relationships between words and phrases in a sentence.
However, CapsNet has limitations in updating the connection weights between capsules of dynamic routing. Inadequate adjustments of the weights can lead to training instability, ultimately impacting the model's performance. To further enhance the performance of the CapsNet, this paper creates an Improved Capsule Network (ICN) module. This module explicitly incorporates a modified squash function to improve numerical stability and ensure smooth, convergent training processes. Thus, the ICN module effectively captures and utilizes critical features to overcome traditional methods' disadvantages.
Firstly, a non-linear activation function transforms the fusion matrices into capsules. Then, the weight matrix Wij learns the input features and computes the prediction vector mij for emotion capsule j, as shown in Equation (4).
(4)
Where mi represents the i-th feature capsule, indicating low-level emotion features. mij is the prediction vector for the emotion capsule j, and Wij is the transformation matrix. The dynamic routing algorithm iteratively updates the weights between capsules to calculate the coupling coefficients to update bij values, as shown in Equation (5).
(5)
Where vj denotes the vector output for emotion capsule j, and bij denotes the log-prior probability coupling between feature capsule i and sentiment capsule j. By updating bij, cij is subsequently changed. Each capsule's vector length represents the probability of a specific emotion capsule, ranging from zero to one, as shown in Equation (6).
(6)
Where cij denotes the weight assigned to capsule j by capsule i, and bij represents the raw coupling coefficients in the dynamic routing algorithm between capsule i and capsule j. Next, the sum of all mij with different weights generates Sj. Sj denotes the j-th capsule in the sentiment capsule layer, as shown in Equation (7).
(7)
We modify the squash function to enhance numerical stability and ensure smooth and convergent training processes, as shown in Equation (8).
(8)
Finally, the dynamic routing algorithm calculates the dot product of mij and vj to update bij. After completing the iterations, the routing process determines the final output, with the endpoint representing the correctly predicted capsule.
Cross-modal interaction fusion.
Increasing dynamic interactions between modalities is crucial in multi-modal sentiment analysis. Recent studies have shown that textual features contribute more significantly than non-textual to sentiment representation in multi-modal sentiment analysis [50–52,56]. Visual features may lead to discrepancies between sentiment categories and the authors' intentions. Scholars have adopted various fusion strategies and attention mechanisms to optimize the fusion of textual and visual features. However, these improvements still need to reduce the interference of visual features in sentiment representation. We utilize the cross-modal interaction fusion module built on the cross-attention mechanism to solve the issue. The method uses textual features as the primary focus, combining image features as an auxiliary component to achieve adaptive fusion and reduce potential visual feature interference. Additionally, the module allows feature representations from the modality to obtain auxiliary information from other modalities, promoting information exchange between modalities.
After obtaining the single modal sentiment features, we calculate the text and image features using the cross-attention mechanism. In the cross-attention mechanism, the inputs for Query, Key, and Value come from two sequences: X and Y. X inputs to Query, while Y inputs to Key and Value. The attention mechanism calculates the correlation between X and Y and multiplies the attention weights with Value to obtain cross-modal interaction features. The method enables adaptive fusion between modalities and reduces redundant information interference. Equation (9) is used to result in image-based text embedding representation.
(9)
X represents textual features, Y represents image features, W represents the linear transformation matrices of features, and H is the fused feature containing cross-modal interaction information.
By applying the cross-attention mechanism, we can leverage the image information to guide the text processing and achieve cross-modal information interaction and fusion. The method improves model performance and reduces overfitting risk.
Sentiment analysis
After obtaining the final fused features H, we input them into a fully connected layer and then normalized them with softmax. Equation (10) is used to get the sentiment polarity P.
(10)
Loss function
The Adam optimizer updates weights during the training phase in the model's classification task. We adopt the KL divergence loss. Equation (11) is used to calculate the KL divergence loss.
(11)
Where p(i) is the probability of the actual label being i and q(i) is the probability of the predicted label being i. The cross-entropy loss function evaluates the expected results and actual value discrepancy. Equation (12) is used to calculate the cross-entropy loss.
(12)
Where yi denotes the actual label of sample i, and is the probability of the predicted sample belonging to class i. Equation (13) is used to calculate the total objective loss.
(13)
Where is the KL divergence loss, is the cross-entropy loss, and is the total objective loss.
Experiments and discussions
Experimental setup
Dataset.
In this study, we scrape a series of public opinion events as our dataset on Sina Weibo, the largest online platform in China. The events include “Xu Shen Kindergarten,” “Datong Underage Bullying,” “Cultural Tourism Promotion,” “Demise of Four Major Fraud Families in Northern Myanmar,” and “Chongqing Siblings’ Death Case,” and so on, spanning from February 2023 to February 2024. A total of 11,165 instances containing both text and images are collected. We manually annotate the sentiment labels, categorizing them into positive, neutral, and negative sentiments. The final dataset is randomly divided into training and testing sets with a split ratio of 8:2. The distribution of the Weibo dataset is shown in Table 1.
[Figure omitted. See PDF.]
The data in this study is collected from publicly available sources, strictly following the terms and conditions of the source platform. The data analysis and processing are solely for academic research purposes. We ensure the legality of the collection and analysis method.
Implementation details.
Our experiments utilize a single RTX 4090 24GB GPU for training and testing. Model parameters are optimized with the Adam optimizer. The experiments are implemented in Python 3.12 and PyTorch 2.3.0, with CUDA 12.1 to support GPU-accelerated computation. The specific model parameters are listed in Table 2.
[Figure omitted. See PDF.]
Evaluation metrics.
To evaluate the effectiveness of the proposed model, we select several metrics, including accuracy, precision, recall, and F1-score. Equations (14)–(17) are used to calculate the evaluation metrics.
(14)(15)(16)(17)
TP denotes true positives, TN denotes true negatives, FP denotes false positives, and FN denotes false negatives. All evaluation metrics range from zero to one, with higher values reflecting better model performance.
Experiment analysis
To evaluate the effectiveness of the proposed SECIF model, this section conducts a comparative experiment with different modalities, including text-only, image-only, and text-image bimodal modalities. Here is a brief introduction to the information on the models.
Text-only models.
1. 1.. Single models
1. Sentiment Lexicon A predefined sentiment lexicon is used to evaluate sentiment tendencies by matching words in the text.
2. SVM: Construct one or more hyperplanes in high-dimensional space to classify data effectively.
3. BiLSTM: Enhance the capture of contextual dependencies by considering both forward and backward context information in text data.
4. BERT: Utilize extensive pre-training on text data to capture complex contextual relationships through the Transformer structure, achieving deep semantic understanding.
1. 2.. Hybrid models
1. BM: Based on the BERT model, combined with multi-head attention to capture complex contextual relationships in text.
2. BGM: Extend BM by introducing the GMHA, enhancing the aggregation of textual information.
3. BGMIC1: Integrate BGM with an original CapsNet.
4. BGMIC2 Similar to BGMIC1, but with the squash of capsule coefficient set to 0.5
5. BGMIC3: Integrate BGM and the ICN module
Image-only models.
1. CNN: Extract low- to high-level features from images through successive convolution operations.
2. VGG16: A 16-layer CNN with small 3x3 filters is used for image classification.
3. ResNet50: A deep residual network with 50 layers is used for image classification.
4. ResNet101: An extended version of ResNet50 with 101 layers, providing a more profound residual network architecture.
Text and image models.
1. ElemAdd: Combine text features from BGMIC3 and image features from ResNet101 through element-wise addition.
2. Concat: Concatenate text features from BGMIC3 and image features from ResNet101 features.
3. ElemMul: Combine text features from BGMIC3 and image features from ResNet101 through element-wise multiplication.
4. BGMR: Integrate text features from BGM with image features from ResNet101 using the cross-modal interaction attention mechanism. Text features serve as queries, and image features as keys and values, allowing interactive fusion of both modalities.
5. BGMICR1: Integrate text features from BGMIC1 with image features from ResNet101 using a cross-modal interaction attention mechanism.
6. BGMICR2: Text features from BGMIC2 with image features from ResNet101 are integrated using a cross-modal interaction attention mechanism.
7. BGMICR3(SECIF): Text features from BGMIC3 with image features from ResNet101 are integrated using a cross-modal interaction attention mechanism, as proposed in this study.
In the single-text hybrid model, the attention mechanism, the Gather mechanism, and an improved capsule network are introduced in this paper, respectively. For further data enhancement, we fuse the image information and compare the performance of different text-image fusion models. Table 3 presents the experimental results of various models on the Weibo dataset.
[Figure omitted. See PDF.]
Firstly, in the single text models experiment, SVM achieves a 9.74% higher accuracy than the traditional sentiment lexicon, demonstrating the advantages of machine learning in handling new vocabulary and polysemy. Additionally, the average accuracy of BiLSTM and BERT shows an improvement of 8.58% and 19.16% over SVM and sentiment lexicon, respectively. Deep learning models can better comprehend complex contextual information through bidirectional encoding techniques. Among the models, BERT performs the best, achieving an accuracy of 84.19% by leveraging rich linguistic features from large-scale corpora.
Secondly, in the hybrid text models experiment, the hybrid text methods are superior to single text methods. BM achieves a 0.21% higher accuracy than BERT, demonstrating that the multi-head attention mechanism can capture and integrate diverse semantic information from the text. The accuracy of BGM improves by 1.58% over BM. BGM achieves a 1.8% improvement over BERT, indicating that the proposed GMHA mechanism more effectively aggregates critical details within the text. The average accuracy of BGMIC1, BGMIC2, and BGMIC3 reaches 86.98%, reflecting a 1.48% improvement over BGM. This result demonstrates the capsule network can better understand semantic information and improve feature extraction efficiency. BGMIC3 achieves a 2.77% higher accuracy than BGM, further demonstrating the proposed ICN module's effectiveness in capturing complex contextual dependencies and enhancing the capability of text feature representations. Additionally, BGMIC3’s accuracy improves by 4.63% over BERT, showing the advantage of the semantic enhancement module in enhancing contextual feature extraction, leading to better classification performance.
Thirdly, in the image-only models experiment. VGG16’s accuracy improves by 15.64% over the traditional CNN, showing that increased convolutional layers enhance classification performance. The average accuracy of ResNet50 and ReaNet101 achieves a 4.74% improvement over VGG16, demonstrating that introducing residual blocks enhances image classification. ResNet101 achieves a 1.10% higher accuracy than ResNet50, indicating the increased number of residual blocks continues to improve classification performance.
In the comparative experiment between text and image models, the image models achieve an average accuracy of 49.83%. The text models achieve an average accuracy exceeding 80%, suggesting that text typically contains richer semantic information and emotional features, making it more suitable for analyzing sentiment on social media. The experiment demonstrates the efficacy of the strategy employed in this study, where text serves as the primary feature and images are used as auxiliary features.
Finally, in the text-image bimodal models experiment, simple concatenation methods, such as ElemAdd and Concat, result in lower accuracy than text-only hybrid models due to interference from image noise. The accuracy of ElemMul is lower than that of BGMIC3, indicating that ElemMul is a simple fusion method that needs more effective integration of text-image interactions. The average accuracy of BGMICR1, BGMICR2, and BGMICR3 improves by 1.20% over BGMR, demonstrating the advantage of the Capsule Network in enhancing understanding of semantic information and feature extraction efficiency. BGMICR3 achieves a 2.20% higher accuracy than BGMR, indicating the efficacy of the proposed ICN module in capturing complex contextual dependencies and enhancing the capability of text feature representations. Additionally, the average accuracy of BGMICR1, BGMICR2, and BGMICR3 is 89.97%, representing an improvement of 2.13% and 69.69% over the best-performing text model BGMIC3 and image model ResNet101. This improvement highlights the significant advantage of the cross-modal interaction and fusion module in integrating text and image features.
In general, the proposed SECIF (BGMICR3) model outperforms other models and effectively improves the accuracy of sentiment analysis predictions on social media. Fig 2 visualizes a more intuitive comparison of the performance differences among the models.
[Figure omitted. See PDF.]
Visualization analysis
In this study, we utilize the confusion matrix to illustrate the classification performance of text-only and image-only models, as shown in Fig 3 and Fig 4. The visualization results provide intuitive representations of the models' performance in classifying positive, neutral, and negative sentiments, highlighting the models' strengths and weaknesses.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Confusion matrix of text-only models
Fig 3 shows the classification visualization of text-only models. In the single text models experiment, the sentiment lexicon, SVM, and Bi-directional Long Short-Term Memory (BiLSTM) perform better at classifying negative sentiment but struggle with neutral and positive sentiments. The result indicates a higher sensitivity to negative sentiment in social media under the balanced data condition. In contrast, BERT shows outstanding performance across positive, neutral, and negative sentiment classifications, illustrating the superior ability to capture contextual information and handle complex sentiment information within the text.
In the hybrid text models experiment, BM outperforms BERT in classifying negative and positive sentiments, significantly reducing misclassifications across categories. The improvement suggests that the multi-head attention mechanism effectively captures semantic information from the text. Furthermore, BGM increases the number of correctly classified samples compared to BM, particularly in neutral and positive sentiment classifications, indicating the advantages of the GMHA mechanism in information aggregation. Compared to BERT, BGMIC1, BGMIC2, and BGMIC3 significantly improve classifying positive and negative sentiments. Among these models, BGMIC3 achieves the highest accuracy in classifying positive sentiment by introducing the ICN module, leading to the best overall classification performance. This experiment further validates the effectiveness of the proposed semantic enhancement module.
Confusion matrix of image-only models
Fig 4 presents a visualization of the classification results of image-only models. The CNN model performs well in classifying negative sentiment but struggles with neutral sentiment. As the convolution layers increase, VGG16 improves negative sentiment classification compared to CNN. With the introduction of residual blocks, ResNet50 and ResNet101 enhance the accuracy of neutral sentiment classification and effectively reduce errors in negative sentiment classification. ResNet101, in particular, excels in neutral sentiment classification and further improves accuracy in identifying negative sentiment, demonstrating greater adaptability in sentiment analysis tasks.
Loss rate analysis
This study conducts a comparative analysis to evaluate the performance of various multimodal feature fusion strategies. Fig 5 shows the loss rate trends of the multimodal fusion models, highlighting differences in convergence speed and final performance.
[Figure omitted. See PDF.]
Concat initially starts with a lower loss rate but experiences a slower decline, ultimately reaching the highest loss rate among the models, indicating Concat has a weaker aggregation capability in feature fusion. ElemMul shows a high initial loss rate but converges faster and achieves a lower loss rate than ElemAdd and Concat, indicating strong convergence ability. ElemAdd’s overall loss rate is similar to ElemMul’s, but it converges more slowly than ElemMul in the early stages. BGMR achieves a lower loss rate than ElemMul, indicating the importance of effective feature fusion in multimodal sentiment analysis. BGMICR1, BGMICR2, and BGMICR3 consistently maintain lower loss rates than BGMR, demonstrating the crucial role of the Capsule Network in rapid convergence. BGMICR1, BGMICR2, and BGMICR3 perform exceptionally well in sentiment analysis. Their loss rates decrease rapidly in the initial stages, stabilize by the fourth iteration, and reach the lowest overall loss rates, demonstrating the effectiveness of the cross-modal interaction fusion module. Additionally, BGMICR3 performs the best among all the fusion models, demonstrating the efficacy of the ICN module in capturing complex contextual dependencies and rapid convergence.
Baseline models comparison experiment
Datasets.
We use two publicly available datasets in the experiment: MVSA-Single and MVSA-Multiple [60]. The MVSA datasets from Twitter consist of text-image pairs categorized as positive, neutral, and negative. The distributions of the MVSA datasets are shown in Table 4.
[Figure omitted. See PDF.]
Baseline models.
The following baseline models are employed for the comparative experiment to verify the superiority of the SECIF model in sentiment analysis.
1. MultiSentiNet [61]: Extract visual semantic information from images to identify key emotional words in the text and integrate features for prediction.
2. HSAN [62]: Use a hierarchical structure to represent tweet levels, extracting visual features from captions and employing a contextual attention mechanism for encoding.
3. Co-MN-Hop6 [63]: Iteratively capture and understand the relationship between text and images.
4. MGNNS [64]: Encode various modalities for hidden representations, utilize a multi-channel GNN for learning, and achieve multimodal fusion through an MHA.
5. CLMLF [65]: Use contrastive learning to align and fuse text and image features via a multi-layer fusion module.
6. MVCN [66]: Reduce redundant visual elements using a text-guided fusion module, maintain feature consistency with an emotion constraint, and address inconsistent labels with adaptive loss calibration.
7. CiteNet [67]: Employ contrastive learning for modality alignment and achieve multimodal fusion with an enhanced integration module.
8. MIGSIE [68]: Improve image-text interaction with a text-guided multi-channel module, use GNN for co-occurrence feature extraction, and integrate multimodal features through a multi-source representation module.
9. DTN [69]: Facilitate modality interaction using a deep cross-modal attention network and two-stage feature fusion with gating and attention to adjust weights dynamically.
10. SRC-Model [70]: Facilitate bidirectional image-text interaction using cross-modal attention, refine emotional representation with a gating module, and combine image and text features through an MHA.
Comparative experimental results.
In this study, a comparative analysis is conducted to evaluate the performance of the proposed model. The comparative experimental results are shown in Table 5.
[Figure omitted. See PDF.]
In the MVSA-Single dataset, the baseline models achieve a mean accuracy of 74.33% and an F1 score of 72.76%. The SECIF model improves accuracy by 6.42% and the F1 score by 8.21%. In the MVSA-Multiple dataset, the baseline models obtain a mean accuracy of 71.03% and an F1 score of 69.43%. The SECIF model demonstrates enhancements of 2.91% in accuracy and 4.84% in the F1 score. Experimental results show the significant performance of the SECIF model over the baseline models. The SECIF model introduces a semantic enhancement module for processing the initial features of the data, which effectively reduces the influence of noise and enhances the representation of text features. In addition, compared with the simple attention mechanism model, SECIF captures the relationship between text and image more accurately through the cross-modal interaction fusion module.
Case study
This study examines the SECIF cases to determine the reasons for the errors and to suggest improvements. Fig 6 shows examples of correct and incorrect predictions.
[Figure omitted. See PDF.]
In Fig 6(a), the text contains phrases such as “powerless” and “severe punishment quickly,” which express the blogger's negative sentiment. The male's serious and sorrowful facial expression also represents the same emotion. After combining the text and image, the model's prediction result matches the ground truth label, confirming the accuracy of the SECIF’s prediction. The phrase “so worth it” suggests a positive sentiment in Fig 6(b), further conveyed by the thumbs-up gesture and the smiling face. However, the real implication of the comment is that the eyebrow pencil is overpriced and does not match its actual value. The result reflects the sarcastic tone as well as the negativity of the blogger. The data lacks external knowledge, which restricts the understanding of the comments' context and leads to deviation in emotional analysis. In Fig 6(c), the text discusses a public opinion event's cause, outcome, and recommendations, suggesting a neutral sentiment. The spliced images fail to personalize the process during feature extraction. Some crucial details may be missing, which results in the loss of image information.
Conclusions and future work
Conclusions
Public opinion plays a crucial role in maintaining social stability in social media. As a result, it is necessary to build an accurate sentiment prediction model for public opinion. This study proposes a model based on Semantic Enhancement and Cross-Modal Interactive Fusion (SECIF), which significantly improves the accuracy of sentiment analysis on social media. The conclusions are as follows.
Firstly, the proposed GMHA mechanism in aggregating crucial semantic information is adequate. BGM achieves a 1.8% and 1.58% improvement over BERT and BM, respectively. These results show that the GMHA mechanism enables more accurate aggregation of vital semantic information, reducing interference from noise and improving sentiment classification accuracy.
Secondly, the proposed ICN module in capturing complex contextual dependencies is also essential. In the text-only models experiment, the average accuracy of BGMIC1, BGMIC2, and BGMIC3 achieves a 1.48% improvement over BGM. BGMIC3 achieves a 2.77% higher accuracy than BGM. Additionally, in the text-image bimodal models experiment, the average accuracy of BGMICR1, BGMICR2, and BGMICR3 improves by 1.20% over BGMR. BGMICR3 achieves a 2.20% higher accuracy than BGMR. These results show the effectiveness of the ICN module in handling long-distance dependencies and enhancing the capability of text feature representations.
Thirdly, the authenticity of social media data is compelling. The model is validated using a collected and annotated dataset of opinion images and texts from Sina Weibo, confirming the reliability of the experimental results and the feasibility of practical applications. The solid foundation for performance evaluation enhances the research outcomes' real-world significance and reference value.
Finally, our model demonstrates excellent performance on the two publicly available datasets. We compare the proposed model with ten baseline models. The experimental results indicate that the SECIF model improves accuracy by 4.70% and the F1 score by 6.56% compared to the baselines on the MVSA datasets.
In summary, the proposed semantic enhancement module significantly improves the model's capability to capture and understand sentiment expressions on social media. The proposed SECIF model elevates the performance of sentiment analysis, provides a powerful tool for public opinion management, and offers valuable insights for future research in this domain.
Future work
In future work, we plan to conduct additional research to achieve even more accurate sentiment analysis in social media. On the one hand, the proposed model fails to understand the context of the comments in recognizing sarcastic and implied emotions. We will further introduce external knowledge to identify complex emotions. On the other hand, some vital details in the clipping operation may be missing, which will cause the loss of image information and affect sentiment analysis ability. We will improve image processing techniques by integrating image segmentation, spatial attention, and channel attention. Moreover, we will add more datasets to strengthen the model's ability to capture and understand user emotions on different platforms.
References
1. 1. Cambria E, Wang H, White B. Guest Editorial: Big Social Data Analysis. Knowledge-Based Systems. 2014;69:1–2.
* View Article
* Google Scholar
2. 2. Cortis K, Davis B. Over a decade of social opinion mining: a systematic review. Artif Intell Rev. 2021;54(7):4873–965. pmid:34188346
* View Article
* PubMed/NCBI
* Google Scholar
3. 3. Yang J, Xiao Y, Du X. Multi-grained fusion network with self-distillation for aspect-based multimodal sentiment analysis. Knowledge-Based Systems. 2024;293:111724.
* View Article
* Google Scholar
4. 4. Hu R, Yi J, Chen A, Chen L. Multichannel cross-modal fusion network for multimodal sentiment analysis considering language information enhancement. IEEE Trans Ind Inf. 2024;20(7):9814–24.
* View Article
* Google Scholar
5. 5. Van Bavel JJ, Robertson CE, Del Rosario K, Rasmussen J, Rathje S. Social media and morality. Annu Rev Psychol. 2024;75:311–40. pmid:37906950
* View Article
* PubMed/NCBI
* Google Scholar
6. 6. Mu G, Li J, Liao Z, Yang Z. An Enhanced IHHO-LSTM Model for Predicting Online Public Opinion Trends in Public Health Emergencies. Sage Open. 2024;14(2).
* View Article
* Google Scholar
7. 7. Almalis I, Kouloumpris E, Vlahavas I. Sector-level sentiment analysis with deep learning. Knowledge-Based Systems. 2022;258:109954.
* View Article
* Google Scholar
8. 8. Chutia T, Baruah N. A review on emotion detection by using deep learning techniques. Artif Intell Rev. 2024;57(8).
* View Article
* Google Scholar
9. 9. Singh U, Abhishek K, Azad HK. A Survey of Cutting-edge Multimodal Sentiment Analysis. ACM Comput Surv. 2024;56(9):1–38.
* View Article
* Google Scholar
10. 10. Li Z, Guo Q, Pan Y, Ding W, Yu J, Zhang Y, et al. Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis. Information Fusion. 2023;99:101891.
* View Article
* Google Scholar
11. 11. Wang Y, Huang G, Li M, Li Y, Zhang X, Li H. Automatically Constructing a Fine-Grained Sentiment Lexicon for Sentiment Analysis. Cogn Comput. 2022;15(1):254–71.
* View Article
* Google Scholar
12. 12. Nguyen H, Le T, Le H, Pham T. Domain specific sentiment dictionary for opinion mining of Vietnamese text. Multi-disciplinary trends in artificial intelligence. Springer International Publishing. 2014:136–48.
13. 13. Bermudez-Gonzalez D, Miranda-Jiménez S, García-Moreno R-U, Calderón-Nepamuceno D. Generating a Spanish affective dictionary with supervised learning techniques. New perspectives on teaching and working with languages in the digital era. 2016:327–38.
* View Article
* Google Scholar
14. 14. Yan X, Huang T. Tibetan sentiment classification based on emotion dictionary. Journal of Chinese Information Processing. 2018;32(2):75–80.
* View Article
* Google Scholar
15. 15. Mu G, Dai L, Ju X, Chen Y, Huang X. MS-IHHO-LSTM: Carbon Price Prediction Model of Multi-Source Data Based on Improved Swarm Intelligence Algorithm and Deep Learning Method. IEEE Access. 2024;12:80754–69.
* View Article
* Google Scholar
16. 16. Jiang S, Yang Y, Liao J. Research of building Chinese musical emotional lexicon and emotional classification. Computer Engineering and Applications. 2013;50:118–21.
* View Article
* Google Scholar
17. 17. Liu Y, Lu X, Deng K, Ruan K, Liu J. Construction method of sentiment lexicon for photography reviews. Computer Engineering and Design. 2019;40(10):3037–42.
* View Article
* Google Scholar
18. 18. Zhao Y, Qin B, Shi Q, Liu T. Large-scale sentiment lexicon collection and its application in sentiment classification. Journal of Chinese Information Processing. 2017;31(2):187–93.
* View Article
* Google Scholar
19. 19. Wang Y, Li B. Sentiment Analysis for Social Media Images. 2015 IEEE International Conference on Data Mining Workshop (ICDMW). 2015:1584–91.
* View Article
* Google Scholar
20. 20. Gherkar Y, Gujar P, Gaziyani A, Kadu S. Sentiment Analysis of Images using Machine Learning Techniques. ITM Web Conf. 2022;44:03029.
* View Article
* Google Scholar
21. 21. Wang X, Wang Q, Cai H. Research on recognition model and recognition effect of network public opinion visual emotion under multi-dimensional attention mechanism. Data Analysis and Knowledge Discovery. 2024;8(1):156–67.
* View Article
* Google Scholar
22. 22. Fan T, Wang H, Lin K, Liu Y. A probe into netizen sentiment analysis in vision-based internet public opinion events. Information and Documentation Services. 2022;43:83–91.
* View Article
* Google Scholar
23. 23. Fu Y, Zhang Z, Yang R, Yao C. Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing. 2024;571:127201.
* View Article
* Google Scholar
24. 24. Tang Z, Xiao Q, Zhou X, Li Y, Chen C, Li K. Learning discriminative multi-relation representations for multimodal sentiment analysis. Information Sciences. 2023;641:119125.
* View Article
* Google Scholar
25. 25. Lin R, Hu H. Dynamically Shifting Multimodal Representations via Hybrid-Modal Attention for Multimodal Sentiment Analysis. IEEE Trans Multimedia. 2024;26:2740–55.
* View Article
* Google Scholar
26. 26. Hu Y, Huang X, Wang X, Lin H, Zhang R. Transformer-based adaptive contrastive learning for multimodal sentiment analysis. Multimed Tools Appl. 2024;84(3):1385–402.
* View Article
* Google Scholar
27. 27. Liu S, Gao P, Li Y, Fu W, Ding W. Multi-modal fusion network with complementarity and importance for emotion recognition. Information Sciences. 2023;619:679–94.
* View Article
* Google Scholar
28. 28. Zhang Y, Miao D, Zhang Z. Multi-granularity text sentiment classification model based on three-way decisions. Computer Science. 2017;44:188–93.
* View Article
* Google Scholar
29. 29. Tang L, Liu C. Extraction of feature and sentiment word pair based on conditional random fields and HITS algorithm. Computer Technology and Development. 2019;29(7):71–5.
* View Article
* Google Scholar
30. 30. Yang L, Wang M, Cheng Y. Microblog sentiment analysis of Jiangsu environmental public service based on LDA and XGBoost models. Journal of Nanjing University of Posts and Telecommunications (Social Science). 2019;21:23–39.
* View Article
* Google Scholar
31. 31. Zeng Z, Yang Q. Sentiment analysis for micro-blogs with LDA and AdaBoost. Data Analysis and Knowledge Discovery. 2018;2:51–9.
* View Article
* Google Scholar
32. 32. Wang X, Fan M, Kong M, Pei Z. Sentiment lexical strength enhanced self-supervised attention learning for sentiment analysis. Knowledge-Based Systems. 2022;252:109335.
* View Article
* Google Scholar
33. 33. Jelodar H, Wang Y, Orji R, Huang S. Deep sentiment classification and topic discovery on novel coronavirus or COVID-19 online discussions: NLP Using LSTM Recurrent Neural Network Approach. IEEE J Biomed Health Inform. 2020;24(10):2733–42. pmid:32750931
* View Article
* PubMed/NCBI
* Google Scholar
34. 34. Zeng Z, Yu P. Sentiment analysis of public safety events in micro-blog based on double-layered attention and bi-LSTM. Information Science. 2019;37:23–9.
* View Article
* Google Scholar
35. 35. Qian Y, Wang J, Li D, Zhang X. Interactive capsule network for implicit sentiment analysis. Appl Intell. 2022;53(3):3109–23.
* View Article
* Google Scholar
36. 36. Baltrusaitis T, Zadeh A, Lim YC, Morency L-P. OpenFace 2.0: Facial behavior analysis toolkit. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 2018:59–66.
* View Article
* Google Scholar
37. 37. Baltrusaitis T, Ahuja C, Morency L-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019;41(2):423–43. pmid:29994351
* View Article
* PubMed/NCBI
* Google Scholar
38. 38. Mai S, Zeng Y, Zheng S, Hu H. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans Affective Comput. 2023;14(3):2276–89.
* View Article
* Google Scholar
39. 39. Das R, Singh TD. Multimodal sentiment analysis: A survey of methods, trends, and challenges. ACM Comput Surv. 2023;55(13s):1–38.
* View Article
* Google Scholar
40. 40. Morency L-P, Mihalcea R, Doshi P. Towards multimodal sentiment analysis. Proceedings of the 13th international conference on multimodal interfaces. 2011.
* View Article
* Google Scholar
41. 41. Hussain SM, Calvo RA, Aghaei Pour P. Hybrid fusion approach for detecting affects from multichannel physiology. Berlin: Springer, Berlin, Heidelber. 2011:568–77.
42. 42. Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S. Multimodal sentiment analysis Using Hierarchical fusion with Context modeling. Knowledge-Based Systems. 2018;161:124–133.
* View Article
* Google Scholar
43. 43. Zhou W, Dong S, Lei J, Yu L. MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding. IEEE Trans Intell Veh. 2023;8(1):48–58.
* View Article
* Google Scholar
44. 44. Zhu C, Yi B, Luo L. Base on contextual phrases with cross-correlation attention for aspect-level sentiment analysis. Expert Systems with Applications. 2024;241:122683.
* View Article
* Google Scholar
45. 45. Cheng H, Yang Z, Zhang X, Yang Y. Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion. IEEE Trans Affective Comput. 2023;14(4):3149–63.
* View Article
* Google Scholar
46. 46. Hu H, Ding Z, Zhang Y, Liu M. Images-text sentiment analysis in social media based on joint and interactive attention. Journal of Beijing University of Aeronautics and Astronautics. 2013;38:1–11.
* View Article
* Google Scholar
47. 47. Li M, Zhang J, Zhang X, Liu L. Multimodal sentiment analysis based on cross-modal semantic information enhancement. Journal of Frontiers of Computer Science & Technology. 2023:1–13.
* View Article
* Google Scholar
48. 48. Luo Y, Wu R, Liu J, Tang X. Multimodal sentiment analysis method for sentimental semantic inconsistency. Journal of Computer Research and Development. 2024:1–12.
* View Article
* Google Scholar
49. 49. Mukherjee S, Ahmed N, Vasantha Ramachandran R, Bhat R, Kumar Saini D, Ghosh A. Anisotropy, topography and non-newtonian properties of cellular interiors probed by helical magnetic nanobots. J Microbio Robot. 2025;21(1):6. pmid:40061047
* View Article
* PubMed/NCBI
* Google Scholar
50. 50. Wang D, Liu S, Wang Q, Tian Y, He L, Gao X. Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans Multimedia. 2023;25:4909–21.
* View Article
* Google Scholar
51. 51. Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency L-P, et al. Integrating multimodal information in large pretrained transformers. Proc Conf Assoc Comput Linguist Meet. 2020;2020:2359–69. pmid:33782629
* View Article
* PubMed/NCBI
* Google Scholar
52. 52. Zhu C, Chen M, Zhang S, Sun C, Liang H, Liu Y, et al. SKEAFN: Sentiment knowledge enhanced attention fusion network for multimodal sentiment analysis. Information Fusion. 2023;100:101958.
* View Article
* Google Scholar
53. 53. Zeng Y, Li Z, Tang Z, Chen Z, Ma H. Heterogeneous graph convolution based on In-domain Self-supervision for Multimodal Sentiment Analysis. Expert Systems with Applications. 2023;213:119240.
* View Article
* Google Scholar
54. 54. Jia Y, Wu W, Yang C, Gu X, Yan X, Ma T. Multi-interactive attention aspect-level sentiment analysis based on graph convolution network. Computer Engineering and Design. 2023;44.
* View Article
* Google Scholar
55. 55. Lu N, Han Z, Han M, Qian J. Bi-stream graph learning based multimodal fusion for emotion recognition in conversation. Information Fusion. 2024;106:102272.
* View Article
* Google Scholar
56. 56. Yang J, Yu Y, Niu D, Guo W, Xu Y. ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
* View Article
* Google Scholar
57. 57. Zeng Y, Li Z, Chen Z, Ma H. A feature-based restoration dynamic interaction network for multimodal sentiment analysis. Engineering Applications of Artificial Intelligence. 2024;127:107335.
* View Article
* Google Scholar
58. 58. Yanulevskaya V, van Gemert JC, Roth K, Herbold AK, Sebe N, Geusebroek JM. Emotional valence categorization using holistic image features. 2008 15th IEEE International Conference on Image Processing. 2008:101–4.
* View Article
* Google Scholar
59. 59. Yang R, Ma J. A feature-enhanced multi-modal emotion recognition model integrating knowledge and Res-ViT. Data Analysis and Knowledge Discovery. 2023;7.
* View Article
* Google Scholar
60. 60. Niu T, Zhu S, Pang L, El Saddik A. Sentiment Analysis on Multi-View Social Data. MultiMedia Modeling. USA: Spring; 2016. pp. 15–27. https://doi.org/10.1007/978-3-319-27674-8_2
61. 61. Xu N, Mao W. MultiSentiNet. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017:2399–402.
* View Article
* Google Scholar
62. 62. Xu N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). 2017. https://doi.org/10.1109/isi.2017.8004895
63. 63. Xu N, Mao W, Chen G. A co-memory network for multimodal sentiment analysis. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018. https://doi.org/10.1145/3209978.3210093
64. 64. Yang X, Feng S, Zhang Y, Wang D. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
* View Article
* Google Scholar
65. 65. Li Z, Xu B, Zhu C, Zhao T. CLMLF:A contrastive learning and multi-layer fusion method for multimodal sentiment detection. Findings of the Association for Computational Linguistics: NAACL 2022. 2022.
* View Article
* Google Scholar
66. 66. Wei Y, Yuan S, Yang R, Shen L, Li Z, Wang L, et al. Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
* View Article
* Google Scholar
67. 67. Wang J, Yang Y, Liu K, Xie Z, Zhang F, Li T. CiteNet: Cross-modal incongruity perception network for multimodal sentiment prediction. Knowledge-Based Systems. 2024;295:111848.
* View Article
* Google Scholar
68. 68. Bu Y, Bu F, Zhang Z. Multimodal sentiment analysis of global semantic information enhancement under multi-channel interaction. Computer Engineering Appl. 2024.
* View Article
* Google Scholar
69. 69. Yu B, Shi Z. Deep attention and two-stage fusion of image-text sentiment contrastive learning method. Computer Engineering and Applications. 2024.
* View Article
* Google Scholar
70. 70. Zhang S, Liu J, Jiao Y, Zhang Y, Li Z. Sentiment representation calibration-based model for image-text sentiment analysis. Js Beijing Univ Aeronaut Astronaut. 2024.
* View Article
* Google Scholar
Citation: Mu G, Chen Y, Li X, Dai L, Dai J (2025) Semantic enhancement and cross-modal interaction fusion for sentiment analysis in social media. PLoS One 20(4): e0321011. https://doi.org/10.1371/journal.pone.0321011
About the Authors:
Guangyu Mu
Roles: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Validation, Writing – review & editing
Affiliations: School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun, China, Key Laboratory of Financial Technology of Jilin Province, Changchun, China
Ying Chen
Roles: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing
Affiliation: School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun, China
Xiurong Li
Roles: Resources
E-mail: [email protected]
Affiliation: Faculty of Information Technology, Beijing University of Technology, Beijing, China
ORICD: https://orcid.org/0000-0002-8508-2185
Li Dai
Roles: Investigation
Affiliation: School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun, China
Jiaxiu Dai
Roles: Investigation
Affiliation: School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun, China
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Cambria E, Wang H, White B. Guest Editorial: Big Social Data Analysis. Knowledge-Based Systems. 2014;69:1–2.
2. Cortis K, Davis B. Over a decade of social opinion mining: a systematic review. Artif Intell Rev. 2021;54(7):4873–965. pmid:34188346
3. Yang J, Xiao Y, Du X. Multi-grained fusion network with self-distillation for aspect-based multimodal sentiment analysis. Knowledge-Based Systems. 2024;293:111724.
4. Hu R, Yi J, Chen A, Chen L. Multichannel cross-modal fusion network for multimodal sentiment analysis considering language information enhancement. IEEE Trans Ind Inf. 2024;20(7):9814–24.
5. Van Bavel JJ, Robertson CE, Del Rosario K, Rasmussen J, Rathje S. Social media and morality. Annu Rev Psychol. 2024;75:311–40. pmid:37906950
6. Mu G, Li J, Liao Z, Yang Z. An Enhanced IHHO-LSTM Model for Predicting Online Public Opinion Trends in Public Health Emergencies. Sage Open. 2024;14(2).
7. Almalis I, Kouloumpris E, Vlahavas I. Sector-level sentiment analysis with deep learning. Knowledge-Based Systems. 2022;258:109954.
8. Chutia T, Baruah N. A review on emotion detection by using deep learning techniques. Artif Intell Rev. 2024;57(8).
9. Singh U, Abhishek K, Azad HK. A Survey of Cutting-edge Multimodal Sentiment Analysis. ACM Comput Surv. 2024;56(9):1–38.
10. Li Z, Guo Q, Pan Y, Ding W, Yu J, Zhang Y, et al. Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis. Information Fusion. 2023;99:101891.
11. Wang Y, Huang G, Li M, Li Y, Zhang X, Li H. Automatically Constructing a Fine-Grained Sentiment Lexicon for Sentiment Analysis. Cogn Comput. 2022;15(1):254–71.
12. Nguyen H, Le T, Le H, Pham T. Domain specific sentiment dictionary for opinion mining of Vietnamese text. Multi-disciplinary trends in artificial intelligence. Springer International Publishing. 2014:136–48.
13. Bermudez-Gonzalez D, Miranda-Jiménez S, García-Moreno R-U, Calderón-Nepamuceno D. Generating a Spanish affective dictionary with supervised learning techniques. New perspectives on teaching and working with languages in the digital era. 2016:327–38.
14. Yan X, Huang T. Tibetan sentiment classification based on emotion dictionary. Journal of Chinese Information Processing. 2018;32(2):75–80.
15. Mu G, Dai L, Ju X, Chen Y, Huang X. MS-IHHO-LSTM: Carbon Price Prediction Model of Multi-Source Data Based on Improved Swarm Intelligence Algorithm and Deep Learning Method. IEEE Access. 2024;12:80754–69.
16. Jiang S, Yang Y, Liao J. Research of building Chinese musical emotional lexicon and emotional classification. Computer Engineering and Applications. 2013;50:118–21.
17. Liu Y, Lu X, Deng K, Ruan K, Liu J. Construction method of sentiment lexicon for photography reviews. Computer Engineering and Design. 2019;40(10):3037–42.
18. Zhao Y, Qin B, Shi Q, Liu T. Large-scale sentiment lexicon collection and its application in sentiment classification. Journal of Chinese Information Processing. 2017;31(2):187–93.
19. Wang Y, Li B. Sentiment Analysis for Social Media Images. 2015 IEEE International Conference on Data Mining Workshop (ICDMW). 2015:1584–91.
20. Gherkar Y, Gujar P, Gaziyani A, Kadu S. Sentiment Analysis of Images using Machine Learning Techniques. ITM Web Conf. 2022;44:03029.
21. Wang X, Wang Q, Cai H. Research on recognition model and recognition effect of network public opinion visual emotion under multi-dimensional attention mechanism. Data Analysis and Knowledge Discovery. 2024;8(1):156–67.
22. Fan T, Wang H, Lin K, Liu Y. A probe into netizen sentiment analysis in vision-based internet public opinion events. Information and Documentation Services. 2022;43:83–91.
23. Fu Y, Zhang Z, Yang R, Yao C. Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing. 2024;571:127201.
24. Tang Z, Xiao Q, Zhou X, Li Y, Chen C, Li K. Learning discriminative multi-relation representations for multimodal sentiment analysis. Information Sciences. 2023;641:119125.
25. Lin R, Hu H. Dynamically Shifting Multimodal Representations via Hybrid-Modal Attention for Multimodal Sentiment Analysis. IEEE Trans Multimedia. 2024;26:2740–55.
26. Hu Y, Huang X, Wang X, Lin H, Zhang R. Transformer-based adaptive contrastive learning for multimodal sentiment analysis. Multimed Tools Appl. 2024;84(3):1385–402.
27. Liu S, Gao P, Li Y, Fu W, Ding W. Multi-modal fusion network with complementarity and importance for emotion recognition. Information Sciences. 2023;619:679–94.
28. Zhang Y, Miao D, Zhang Z. Multi-granularity text sentiment classification model based on three-way decisions. Computer Science. 2017;44:188–93.
29. Tang L, Liu C. Extraction of feature and sentiment word pair based on conditional random fields and HITS algorithm. Computer Technology and Development. 2019;29(7):71–5.
30. Yang L, Wang M, Cheng Y. Microblog sentiment analysis of Jiangsu environmental public service based on LDA and XGBoost models. Journal of Nanjing University of Posts and Telecommunications (Social Science). 2019;21:23–39.
31. Zeng Z, Yang Q. Sentiment analysis for micro-blogs with LDA and AdaBoost. Data Analysis and Knowledge Discovery. 2018;2:51–9.
32. Wang X, Fan M, Kong M, Pei Z. Sentiment lexical strength enhanced self-supervised attention learning for sentiment analysis. Knowledge-Based Systems. 2022;252:109335.
33. Jelodar H, Wang Y, Orji R, Huang S. Deep sentiment classification and topic discovery on novel coronavirus or COVID-19 online discussions: NLP Using LSTM Recurrent Neural Network Approach. IEEE J Biomed Health Inform. 2020;24(10):2733–42. pmid:32750931
34. Zeng Z, Yu P. Sentiment analysis of public safety events in micro-blog based on double-layered attention and bi-LSTM. Information Science. 2019;37:23–9.
35. Qian Y, Wang J, Li D, Zhang X. Interactive capsule network for implicit sentiment analysis. Appl Intell. 2022;53(3):3109–23.
36. Baltrusaitis T, Zadeh A, Lim YC, Morency L-P. OpenFace 2.0: Facial behavior analysis toolkit. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 2018:59–66.
37. Baltrusaitis T, Ahuja C, Morency L-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019;41(2):423–43. pmid:29994351
38. Mai S, Zeng Y, Zheng S, Hu H. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans Affective Comput. 2023;14(3):2276–89.
39. Das R, Singh TD. Multimodal sentiment analysis: A survey of methods, trends, and challenges. ACM Comput Surv. 2023;55(13s):1–38.
40. Morency L-P, Mihalcea R, Doshi P. Towards multimodal sentiment analysis. Proceedings of the 13th international conference on multimodal interfaces. 2011.
41. Hussain SM, Calvo RA, Aghaei Pour P. Hybrid fusion approach for detecting affects from multichannel physiology. Berlin: Springer, Berlin, Heidelber. 2011:568–77.
42. Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S. Multimodal sentiment analysis Using Hierarchical fusion with Context modeling. Knowledge-Based Systems. 2018;161:124–133.
43. Zhou W, Dong S, Lei J, Yu L. MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding. IEEE Trans Intell Veh. 2023;8(1):48–58.
44. Zhu C, Yi B, Luo L. Base on contextual phrases with cross-correlation attention for aspect-level sentiment analysis. Expert Systems with Applications. 2024;241:122683.
45. Cheng H, Yang Z, Zhang X, Yang Y. Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion. IEEE Trans Affective Comput. 2023;14(4):3149–63.
46. Hu H, Ding Z, Zhang Y, Liu M. Images-text sentiment analysis in social media based on joint and interactive attention. Journal of Beijing University of Aeronautics and Astronautics. 2013;38:1–11.
47. Li M, Zhang J, Zhang X, Liu L. Multimodal sentiment analysis based on cross-modal semantic information enhancement. Journal of Frontiers of Computer Science & Technology. 2023:1–13.
48. Luo Y, Wu R, Liu J, Tang X. Multimodal sentiment analysis method for sentimental semantic inconsistency. Journal of Computer Research and Development. 2024:1–12.
49. Mukherjee S, Ahmed N, Vasantha Ramachandran R, Bhat R, Kumar Saini D, Ghosh A. Anisotropy, topography and non-newtonian properties of cellular interiors probed by helical magnetic nanobots. J Microbio Robot. 2025;21(1):6. pmid:40061047
50. Wang D, Liu S, Wang Q, Tian Y, He L, Gao X. Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans Multimedia. 2023;25:4909–21.
51. Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency L-P, et al. Integrating multimodal information in large pretrained transformers. Proc Conf Assoc Comput Linguist Meet. 2020;2020:2359–69. pmid:33782629
52. Zhu C, Chen M, Zhang S, Sun C, Liang H, Liu Y, et al. SKEAFN: Sentiment knowledge enhanced attention fusion network for multimodal sentiment analysis. Information Fusion. 2023;100:101958.
53. Zeng Y, Li Z, Tang Z, Chen Z, Ma H. Heterogeneous graph convolution based on In-domain Self-supervision for Multimodal Sentiment Analysis. Expert Systems with Applications. 2023;213:119240.
54. Jia Y, Wu W, Yang C, Gu X, Yan X, Ma T. Multi-interactive attention aspect-level sentiment analysis based on graph convolution network. Computer Engineering and Design. 2023;44.
55. Lu N, Han Z, Han M, Qian J. Bi-stream graph learning based multimodal fusion for emotion recognition in conversation. Information Fusion. 2024;106:102272.
56. Yang J, Yu Y, Niu D, Guo W, Xu Y. ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
57. Zeng Y, Li Z, Chen Z, Ma H. A feature-based restoration dynamic interaction network for multimodal sentiment analysis. Engineering Applications of Artificial Intelligence. 2024;127:107335.
58. Yanulevskaya V, van Gemert JC, Roth K, Herbold AK, Sebe N, Geusebroek JM. Emotional valence categorization using holistic image features. 2008 15th IEEE International Conference on Image Processing. 2008:101–4.
59. Yang R, Ma J. A feature-enhanced multi-modal emotion recognition model integrating knowledge and Res-ViT. Data Analysis and Knowledge Discovery. 2023;7.
60. Niu T, Zhu S, Pang L, El Saddik A. Sentiment Analysis on Multi-View Social Data. MultiMedia Modeling. USA: Spring; 2016. pp. 15–27. https://doi.org/10.1007/978-3-319-27674-8_2
61. Xu N, Mao W. MultiSentiNet. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017:2399–402.
62. Xu N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). 2017. https://doi.org/10.1109/isi.2017.8004895
63. Xu N, Mao W, Chen G. A co-memory network for multimodal sentiment analysis. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018. https://doi.org/10.1145/3209978.3210093
64. Yang X, Feng S, Zhang Y, Wang D. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
65. Li Z, Xu B, Zhu C, Zhao T. CLMLF:A contrastive learning and multi-layer fusion method for multimodal sentiment detection. Findings of the Association for Computational Linguistics: NAACL 2022. 2022.
66. Wei Y, Yuan S, Yang R, Shen L, Li Z, Wang L, et al. Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
67. Wang J, Yang Y, Liu K, Xie Z, Zhang F, Li T. CiteNet: Cross-modal incongruity perception network for multimodal sentiment prediction. Knowledge-Based Systems. 2024;295:111848.
68. Bu Y, Bu F, Zhang Z. Multimodal sentiment analysis of global semantic information enhancement under multi-channel interaction. Computer Engineering Appl. 2024.
69. Yu B, Shi Z. Deep attention and two-stage fusion of image-text sentiment contrastive learning method. Computer Engineering and Applications. 2024.
70. Zhang S, Liu J, Jiao Y, Zhang Y, Li Z. Sentiment representation calibration-based model for image-text sentiment analysis. Js Beijing Univ Aeronaut Astronaut. 2024.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 Mu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The rapid development of social media has significantly impacted sentiment analysis, essential for understanding public opinion and predicting social trends. However, modality fusion in sentiment analysis can introduce a lot of noise because of the differences in semantic representations among various modalities, ultimately impacting the accuracy of classification results. Thus, this paper presents a Semantic Enhancement and Cross-Modal Interaction Fusion (SECIF) model for sentiment analysis to address these issues. Firstly, BERT and ResNet extract feature representations from text and images. Secondly, the GMHA mechanism is proposed to aggregate important semantic information and mitigate the influence of noise. Then, an ICN module is created to capture complex contextual dependencies and enhance the capability of text feature representations. Finally, a cross-modal interaction fusion module is implemented. Text features are considered primary, and image features are auxiliary, enabling the profound integration of textual and visual features. The model's performance is optimized by combining cross-entropy and KL divergence losses. The experiments are conducted using a dataset collected from public opinion events on Sina Weibo. The results demonstrate that the proposed model outperforms comparison models. The SECIF model improves by 11.19%, 82.27%, and 4.83% over the average accuracy of the text-only, image-only, and multimodal models, respectively. The proposed SECIF model is compared with ten baseline models on the publicly available datasets. The experimental results show that the SECIF model improves accuracy by 4.70% and F1 score by 6.56%. Through multimodal sentiment analysis, governments can better understand public emotions and opinion trends, facilitating more targeted and effective management strategies.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer