Abstract

Multimodal sentiment analysis extracts sentiments from multiple modalities like text, images, audio, and videos. Most of the current sentiment classifications are based on single modality which is less effective due to simple architecture. This paper studies multimodal sentiment analysis by combining several deep learning text and image processing models. These fusion techniques are RoBERTa with EfficientNet b3, RoBERTa with ResNet50, and BERT with MobileNetV2. This paper focuses on improving sentiment analysis through the combination of text and image data. The performance of each fusion model is carefully analyzed using accuracy, confusion matrices, and ROC curves. The fusion techniques implemented in this study outperformed the previous benchmark models. Notably, the EfficientNet-b3 and RoBERTa combination achieves the highest accuracy (75%) and F1 score (74.9%). This research contributes to the field of sentiment analysis by showing the potential of combining textual and visual data for more accurate sentiment analysis. This will lay the groundwork for researchers in the future to work on multimodal sentiment analysis.

Details

Title
Multimodal Sentiment Analysis using Deep Learning Fusion Techniques and Transformers
Author
PDF
Publication year
2024
Publication date
2024
Publisher
Science and Information (SAI) Organization Limited
ISSN
2158107X
e-ISSN
21565570
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3084414892
Copyright
© 2024. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.