Full text

Turn on search term navigation

Introduction

The large-scale revenues and rapid growth indicate that consumers have a huge demand for e-commerce services. With the development of the Internet and digital technology, information such as images, videos, audio, and text on various modalities in the e-commerce industry has become more abundant, These multimodal datasets contain rich information and can comprehensively describe the characteristics of users and items. Therefore, introducing multimodal data into the recommendation system can help the recommendation system better understand the interests and needs of users and provide more accurate and personalized recommendation services. The method of introducing multimodal information into traditional search engines and recommendation systems is a key issue worthy of extensive research to serve consumers.

Personalized recommendation systems have become indispensable tools for helping users find relevant information from the vast amount of irrelevant content on the Internet. For example, users on e-commerce platforms often select products based on their visual appearance and text descriptions, and both video content and text tags are important for users to find the products they are interested in [1].

Recommendation systems are an effective way to solve the problem of “information overload” [2]. In the 1990s, collaborative filtering technology was first proposed [3], marking significant progress in the field of personalized recommendation systems. Based on the collaborative filtering algorithm, the association rules between users and items have been explored, and many excellent recommendation algorithms have emerged. Rendle et al. [4] proposed matrix factorization (MF) using Bayesian personalized ranking (BPR) loss to directly obtain association information between users and items for recommendation. However, it ignores the large amount of contextual information embedded in multimodal data such as images, texts, and videos. This limitation restricts the ability of MF to capture users’ subtle preferences derived from rich multimodal content. Wang et al. [5] proposed the neural graph collaborative filtering (NGCF) algorithm, which introduces the concept of convolutional layers into the recommendation system, whereby stacking multiple convolutional layers, the high-order connectivity between users and items is identified. However, it still falls short in modeling complex multimodal interactions and fails to fully exploit the semantic and contextual clues present in multimodal data. He et al. [6] proposed light graph convolutional network (LightGCN), arguing that the nonlinear activation function and feature transformation matrix in NGCF have no direct impact on userID/itemID embedding and can be removed. This algorithm reduces the complexity of the model and improves recommendations. But it compromises the model’s ability to capture complex nonlinear relationships within and between modalities, a limitation that hinders its ability to provide highly accurate recommendations based on multimodal data. Wu et al. [7] proposed self-supervised graph learning (SGL), which adds an auxiliary self-supervisory task based on the classic supervised recommendation task, to strengthen node representation learning through self-learning and mine the relevance of long-tailed data. However, it cannot fully utilize the complementary information existing in different modalities, which limits its recommendation performance. Deep learning can automatically learn features of multimodal data, thereby mapping different data to the same latent space and obtaining a unified representation of the data [8]. The general processes of a multimodal recommendation system include multimodal feature extraction, feature interaction, and recommendations [9]. Multimodal features include table features of items and users, descriptive pictures of items, and evaluation text. Feature interaction involves interacting and fusing the representation vectors of different modal features to obtain the representation vectors of items and users. After obtaining the representation vectors of the users and items, the recommendation model is used to calculate the recommendation probability. Among these, the effective fusion of modal representations in different semantic spaces is an important link for improving recommendation accuracy. The feature interactions of multimodal data can be classified into three types: bridging, fusion, and filtering [9]. Bridging refers to the building of a multimodal information transmission channel to enhance the interactive relationship between users and items. Fusion focuses on the multimodal relationships within an item, fusing different multimodal information to generate feature vectors. Filtering involves discarding information that is irrelevant to user preferences and plays a role in removing noisy data. These three methods are frequently used in combination. In addition, different modal representations of the same object contain unique and common semantic information such that distinguishing between them can improve recommendation capability. Therefore, to address the problem of feature differentiation, some studies have proposed disentangled representation learning (DRL) and contrastive learning (CL). DRL is used to deconstruct entangled modal representations as in DICER [10] and MacridVAE [11]. CL is used for modal alignment, to deepen the feature differences between positive and negative samples, as in MICRO [12]. Zhang et al. proposed a multimodal recommendation algorithm, MICRO, which integrates item information graphs. Considering user–item interactions, a bipartite graph was added between items, to capture the semantic similarity between them. A contrastive learning method was also added to compare the feature representations of each modality with the fused feature representation for the fused feature representation to adaptively extract shared information from the multimodal information. The recommendation results showed a significant improvement, but we believe it still has room for improvement in modeling the interactions between multimodal items and users. MICRO mainly focuses on intra-modal and inter-modal feature fusion, ignoring the deeper integration of user preferences and product attributes across modalities. He and McAuley [13] proposed visual Bayesian personalized ranking (VBPR), which uses a pre-trained model to extract the feature representation of multimodal information and adds it to the BPR model to enhance the recommendation capability. But it fails to fully exploit the synergy between different modalities, such as combining visual and textual information in a cohesive way, which limits its ability to provide comprehensive and accurate recommendations. Wei et al. [14] proposed a multi-modal graph convolution network (MMGCN) to represent user-item interaction information and multimodal information as a graph structure, using graph convolution networks for modeling and analysis. Although MMGCN strives to integrate multimodal information, it lacks a strong inter-modal fusion mechanism, resulting in poor performance in capturing the complex relationships between different modalities. Wei et al. [15] proposed a graph refined convolution network (GRCN) that can adaptively adjust the interaction graph decomposition, thereby pruning the noise edges of pseudo-positive interactions. However, it cannot fully solve the problem of multimodal feature integration. GRCN cannot deeply fuse features of different modalities, which limits its effectiveness in providing highly personalized recommendations. Liao et al. [16] proposed a deep multimodal rank learning (DMRL) model to establish a time-related user preference model and combined it with a dynamic sampling strategy based on sorting to improve recommendation accuracy. Liu et al. [17] proposed the PMGT framework, which performs unsupervised pre-training on the graph by considering the multimodal side information of the items and their relationships, thereby learning the representation of the items. Jung et al. [18] proposed the HAT model, which achieves more accurate personalized clothing recommendations by combining consumers’ purchase history information. Yasser Khalafaoui et al. [19] proposed the CADMR framework, which effectively integrates and utilizes heterogeneous multimodal data by combining cross-attention mechanism and disentanglement learning. Wu et al. [20] proposed a multimodal news recommendation algorithm (MM-Rec), which extracts ROIs from news images through mask-region-based convolutional neural network (RCNN) and then used the pre-trained model vision-and-language BERT (ViLBERT) to encode text and image ROIs and model their intrinsic correlation. Liu et al. [21] recently proposed a gated recurrent unit (AGRU) model based on attention mechanism. The model effectively captures the sequence characteristics and long-term dependencies of well logging data through the attention mechanism, and realizes flexible and context-dependent weighting of different parts of the sequence. Tao et al. [22] proposed the use of a graph attention network (MGAT) on a multimodal information interaction graph to capture complex interaction patterns hidden in user behavior. Graph Attention Network itself has high computational complexity, especially when dealing with large-scale graph structures. Therefore, MGAT may face challenges in computational efficiency when dealing with large-scale datasets. Sun et al. [23] proposed a multimodal knowledge-graph self-attention network (MKGAT). MGAT and MKGAT are highly dependent on large amounts of high-quality multimodal data. If the data is insufficient or noisy, the performance of the model may be affected. Qi et al. [24] proposed a POI category recommendation model (PPCM) based on group preference, which protects user privacy and classifies similar users through LSH technology, combines LSTM and attention mechanism to capture user dependencies, and uses POI categories to mine user interests, thereby improving recommendation performance. Although the MICRO method takes into account the importance of various modal features and promotes the integration of modalities through contrastive learning and other methods, but we believe it still has room for further improvement in establishing the interaction between multimodal items and users and in multimodal feature interaction.

Based on previous studies, we believe that the main limitations of current methods are the insufficient feature interactions between modalities and the inability to deeply integrate the fused multimodal information and the interactions between users. To address these challenges, we propose a cross-fusion activation deep modal integration method for multimedia recommender systems (CFMM). Our method aims to achieve a deeper fusion of product and user information by introducing a Cross-Attention module, which integrates item features from different modalities and user-item interactions through deep feature fusion. In addition, we design a fusion loss function to optimize the multimodal feature embedding process, ensuring that the fused representation better reflects the user preferences and item features.

The main innovation of our work lies in the development of the Cross-Attention module and the fusion loss function, which together enhance the ability of recommender systems to deeply integrate multimodal information. By addressing the limitations of existing methods, our CFMM approach helps advance the state of the art in multimodal recommender system algorithms.

As shown in Fig 1, the CFMM algorithm is an algorithm that focuses on improving the performance of multimedia recommendation systems. It does this by deeply processing the basic data of users and products, product text descriptions, and product image information. First, the algorithm uses SentenceTransformer and deep CNN to extract the text and image features of the product respectively and generate the corresponding embedding vectors. Then, the product review information is used to construct the interaction adjacency matrix between users and products to capture the complex relationship between users and products. The Cross-Attention module dynamically focuses on the most important parts of the two inputs by calculating the attention weights, thereby achieving deep feature fusion. The module receives two types of inputs: one is the item embedding representation (I_G_Embeddings) containing user features and user-item interaction relationships; the other is the item embedding representation (H) that fuses text and image features. During the training process, the algorithm uses comprehensive losses, including BPR loss, Fusion loss, Regularization loss, and Contrastive loss, to optimize model parameters by minimizing these losses and improve recommendation accuracy. Finally, the CFMM algorithm outputs optimized user and product embedding representations, and generates personalized recommendation lists based on these representations.

[Figure omitted. See PDF.]

Multimodal information improves recommendation accuracy, but the fusion method needs to be improved. This study introduces a cross-fusion activated multimodal ensemble (CFMM) for recommendation. Main contributions: 1) The cross-attention fusion model deepens the interaction between multimodal items and users. 2) The fusion loss optimizes the multimodal feature embedding of items. 3) The CFMM integrates four modal loss functions.

The remainder of this paper is organized as follows: The feature enhanced multimodal recommendation method is introduced in Section 2. The experimental details and dataset preparation are presented in Section 3. The recommendation results of the different methods and ablation studies on Amazon public datasets are presented and analyzed in Section 4. The conclusions of this study are summarized in Section 5.

Methodology

Overall structure of CFMM method

A recommendation method that combines multimodal information for e-commerce product recommendations was proposed by Zhang et al. [12]. They calculated the similarity of text and image information between products, thereby constructing a product similarity graph and combining it with the user-item interaction graph using different collaborative filtering models, to complete product

[Figure omitted. See PDF.]

recommendations. This study adopts this idea and believes that there is still room for improvement in the fusion of different modal and user-item information features. Accordingly, a cross-fusion multimodal feature fusion algorithm is proposed for a multimodal product recommendation system.

The framework diagram of the CFMM algorithm is shown in Fig 2, which mainly shows the core components and processes of the algorithm. The Embedding Layer is responsible for mapping the original IDs of users and items as well as image and text features into low-dimensional dense vectors. Graph Convolution integrates the item embeddings of text and image into the similarity graphs of text and image respectively, and updates the embedding representation using the interaction relationship between users and items. The Fusion part fuses the image and text embeddings by calculating the attention weights to obtain the fused item embedding representation H. The Split part is used to separate the adjacency matrix with the interaction relationship between users and items into the item embedding representation with the interaction relationship and the user embedding representation with the interaction relationship. The Cross-Attention part focuses on calculating the importance between the fused multimodal item features H and the item features (Items Embedding) with the interaction relationship between users and items, and then updates the item embedding representation according to the calculated weights, thereby emphasizing whether to pay more attention to the interaction relationship between users and items or the similarity between items. Finally, during the network training phase, a loss function fusion based on fused feature loss is designed and added to better capture the complex relationship between users and items and the correlation between multimodal features, so as to recommend the learned products to users.

[Figure omitted. See PDF.]

The CFMM method diagram comprehensively presents a complete system that builds an adjacency matrix through product text and image features combined with user-item interaction information, uses the Cross-Attention module to achieve deep fusion of multimodal features, and optimizes the model through BPR loss, Fusion loss, Regularization loss and Contrast loss functions, and finally outputs optimized user and item embedding representations to improve the performance of the multimedia recommendation system.

Construction of item-item adjacency matrix

The cross-fusion-based multimodal recommendation method follows the concept of Zhang et al. [12]. The text and image information of the items, as well as the user-item interaction information, are embedded in the representation vector. To explore the correlation between items, the cosine similarity between the multimodal features of the items is calculated. A sparse graph adjacency matrix is constructed using k-nearest neighbor (kNN) [25], and the final similarity adjacency matrix is normalized [12]. The cosine similarity matrix is calculated as follows:

(1)

to obtain the similarity matrix after subjecting the matrix to kNN sparse processing and normalization. The raw modality features are transformed into high-dimensional features and the transformation process is added to the network training. This algorithm learns autonomously to extract effective features. These steps were repeated to obtain the similarity adjacency matrix for the trained items. However, Zhang et al. [12] believed that the similarity adjacency matrix constructed from the original features still carries rich information on the initial graph structures. Therefore, the skip connection method was used to superimpose the original similarity adjacency matrix onto the trained adjacency matrix to stabilize the training process. The adjacency matrix of the multimodal information for an item can be obtained using (2).

(2)

where is the proportional coefficient of the skip connection. It is a hyperparameter used to adjust the proportions of the original features, and is the final adjacency matrix of modality m. After obtaining the item adjacency matrix , we used the methods of Zhang et al. [12], Wu et al. [26] and He et al. [6] and stacked multiple graph convolutional layers to capture the high-order item-item relationships. In the lth layer, the adjacency matrix is calculated using (3).

(3)

where is the lth layer item-embedding matrix of modality m. The ith row of the matrix represents the embedding vector for item i. The initial input embedding matrix is the ID embedding vector of the items for all modalities . Thus, represents the high-order item-item embedding matrix of modality m obtained after l layers of mapping, as shown in the H output by fusion embedding, in Fig 2.

Cross-attention module

The Cross-Attention module is an attention mechanism that allows the model to dynamically focus on the most relevant part of one sequence when processing two different sequences. It mainly plays the role of deeply integrating multimodal features in the recommendation system. In addition to multimodal feature fusion, Cross-Attention can also consider the interaction information between users and items. It calculates attention weights and dynamically focuses on the correlation between items features and user-item interaction features. By modeling the interaction between user features and multimodal features of items, the model can capture users’ preferences for different aspects of items. The Cross-Attention mechanism helps the model capture these complex relationships by deeply fusing multimodal features and user-item interaction information. This fusion not only considers the attributes of the item itself, but also incorporates the interaction information between users and items, allowing the recommendation system to more comprehensively understand user interests and item characteristics, and thus provide more accurate and personalized recommendation services. In this study, the Cross-Attention module is used to fuse two different types of information: an embedding representation containing the original item ID features and the user-item interaction relationship, and an embedding representation containing the item’s text, image information, and the correlation between items.

Input preparation.

As shown in Fig 3, the Cross-Attention module receives two tensors as inputs: I_G_Embeddings and H. The I_G_Embeddings is the item embedding representation enhanced by the multi-layer graph convolution process. It not only contains the characteristics of the item itself, but also integrates the information of users and items that have interactive relationships with the item, so as to more comprehensively represent the characteristics of the item in the recommendation system. By contrast, H represents the comprehensive representation obtained by fusing image and text features. First, the original image and text embedding vectors are reduced from higher dimensions (4096 and 1024) to 64 dimensions through dimensionality reduction operations. For the reduced image and text features, k-nearest neighbor (k-NN) [25] graphs are constructed respectively, which represent the similarity relationship between image and text features. Then, the k-NN graphs of image and text are weighted fused respectively, and the original k-NN graphs are fused. The weight of weighted fusion is controlled by lambda_coeff. The purpose of this step is to combine the characteristics of the original graph and the processed graph to obtain more comprehensive similarity information. Next, the k-NN graphs of image and text are used to update the item embedding representation. Through multi-layer graph convolution operations, the similarity information of image and text is integrated into the item embedding representation. After that, the item embedding representations of image and text are fused, and the attention scores of each item on the image and text are calculated. The scores are converted into probability distributions using the softmax function. These probability distributions reflect the importance of image and text information to each item. Finally, according to the attention weights, the item embedding representations of image and text are weighted and added to obtain the final fused representation H. This representation takes into account both image and text information and weights them according to their importance. However, before applying it to the cross-attention module, this H must be further normalized by the L2 operator to ensure that all feature vectors are located on the unit modulus. This helps in the stable calculation of the subsequent attention fusion unit.

[Figure omitted. See PDF.]

Cross-attention module network structure and working principle used for product feature fusion, including input preparation, projection layer, scaled dot product attention, softmax normalization, and weighted sum of values.

Cross-attention for product information fusion.

The core formula of cross-attention is based on a standard attention mechanism [27]. However, we focus on cross-modal feature fusion. As shown in Fig 3, given the user-item interaction embedding containing user features and the item embedding fused with image and text features, the cross-attention mechanism is calculated as follows:

(4)

where is a tensor containing user features and user-item interactions, is a fusion tensor of item text features and image features. is the batch size, is the sequence length, is the feature dimension, is the dimension of the key vector used to scale the dot product to prevent the gradient from vanishing. is the query projection matrix, is the key projection matrix, is the value projection matrix, and is the output projection matrix. In (4) projects the input tensor and tensor into the query space and key space respectively, then the dot product of the query and the key is calculated and scaled by , apply the softmax function to calculate the attention weights, the values are weighted summed using the attention weights. The final result is the attention weight calculated by the Cross-Attention mechanism. Then, the two sets of features are fused according to the calculated weights, which can get the final fusion item characteristics Fused_items.

Projection Layer: The input tensors and , are projected onto the query, key, and value spaces using separate linear layers. The is treated as a key () and value (), whereas are projected as a query (). The projected tensors are then reshaped.

Scaled Dot Product Attention: The attention weights were calculated by taking the scaled dot product between query () and key ().

Linear Layer Output Cross Weights: The cross-products were further projected through another linear layer to obtain a cross-weighted representation for use in the recommender prediction or ranking task.

Weighted Sum of Values: Then cross-attention weights were then used to calculate the and , The weighted sum of the two sets of product feature vectors was used to obtain the final output of the cross-attention module. The weight of this output is determined by their cross-attention calculation results.

Loss function

In this study, to optimize the performance of the recommendation model, we designed a comprehensive loss function that integrates multiple loss terms, including BPR [4], fusion, regularization, and contrastive loss [12].

BPR Loss: The Bayesian personalized ranking loss [4] is a commonly used loss function in recommendation systems for measuring the prediction accuracy of the interaction strength between user and item embeddings. This is mathematically expressed in (5):

(5)

where represents the training batch; represents the user index; and represent the indexes of positive and negative items, respectively; represents the sigmoid function; , and represent the embedding vectors of user , positive item , and negative , respectively; represents the preference score of user for a positive sample item ; represents the preference score of user for a negative sample item .

Fusion Loss: In recommendation systems, the quality of embedding vectors (such as user and item embeddings) is crucial to the performance of the model [28]. Although traditional matrix decomposition methods are effective, they often fail to capture the complex relationships between embedding vectors and the effects of multimodal information fusion [29–31]. As for the Fusion Loss, we are inspired by the BPR Loss, which constructs the loss function by calculating the relative positions of items in the user preference ranking, avoiding the predicament of lacking accurate item preference labels. This design cleverly bypasses the reliance on specific labels and instead focuses on the relative ranking relationship between items, thereby effectively optimizing the ranking performance of the recommendation system. Based on this idea, we innovatively proposed Fusion Loss. Fusion Loss also uses the core concept of relative ranking, but further focuses on the role of fusion features in product ranking. Specifically, we calculated the text and image features of the items, using ranking feedback from user preferences to guide the network in extracting more expressive features for the items. Specifically, Fusion Loss finds out the products that consumers like and dislike from the fusion features representing the products, so that the fusion features of the products that consumers like are ranked first, and the fusion features of the products that consumers dislike are ranked last, and finally finds out the products preferred by consumers from the fusion features of the products.

In this study, Fusion Loss was specifically designed for recommendation systems. It focuses on measuring the contribution of multimodal item fusion information for recommendation accuracy. In the recommendation system, Fusion Loss encourages users and their interacted items to be clustered more closely in the embedding space, while keeping a certain distance from items that have not interacted in the embedding space. We call the items that the user has interacted with positive samples, and the items that the user has not interacted with negative samples. By increasing the distance between positive and negative samples, the fusion accuracy of multimodal information is improved, thereby improving the quality of recommendations.

As shown in Fig 4, Fusion Loss first receives the fused product features and the user embedding vector of the user-item interaction relationship through the input layer. Then in the interaction calculation layer, the dot product of the user embedding vector and the positive/negative item fusion embedding vector is calculated to obtain the positive/negative interaction score, which reflects the user’s potential interest in different items. In the loss calculation layer, the difference between the positive interaction score and the negative interaction score is transformed using the LogSigmoid function to obtain the loss value, and the final Fusion Loss is obtained by taking the negative average. By minimizing Fusion Loss, on the one hand, the influence of the fusion product on the user’s purchase intention can be integrated, and on the other hand, the model can be constrained to accurately extract the image and text features of the product, thereby improving the performance of the recommendation system.

[Figure omitted. See PDF.]

To optimize interaction prediction accuracy between users and multimodal fusion item information.

The core idea is to optimize the model by penalizing the difference in dot product scores between user and fused item embeddings. Specifically, we calculate the difference in dot product scores between user embedding and fused positive and negative item embeddings, and balance their contribution to the total loss by multiplying them by a small coefficient, which is mathematically expressed in (6):

(6)

where represents the fusion embedding vector of item ; is the weight coefficient of Fusion Loss, which is used to balance the impact of different loss functions on network training.

In minimizing this loss function, the expectation is that the user ‘s preference score for the positive sample item will be as high as possible compared with the preference score for the negative sample item , thereby optimizing the embedding vectors of users and items, especially the fused item embedding vector , for the recommendation systems to predict user preferences more accurately.

Regularization Loss: As a commonly used technique in recommender systems, Regularization Loss prevents model overfitting and improves generalization ability [32]. Overfitting usually occurs when the model complexity is so high that the model starts to learn the noise in the training data, rather than the true data pattern. Regularization avoids overfitting by imposing a penalty term on the model parameters to confine the complexity of the model [33]. which is mathematically expressed in (7):

(7)

where represents the L2 norm; is a decay factor used to control the weight of the regularization term; is the L2 norm sum of squares of user-embedding vectors; is the L2 norm sum of squares of positive-item embedding vectors; is the L2 norm sum of squares of negative-item embedding vectors, and the average value is multiplied by to control the weight of the regularization term.

Contrastive Loss: A loss function proposed by Zhang et al. [12], contrastive loss constrains multimodal information fusion. This study applies contrastive loss to the recommendation system to improve the model’s ability to capture item features (such as images and text) and enhance the prediction accuracy of user-item interactions. This is mathematically expressed in (8):

(8)

where is the current sample in the batch; is the positive sample corresponding to ; is the similarity between the current sample and the corresponding positive sample ; is a parameter used to control the smoothness of the similarity score; and is used to control the effect of contrast auxiliary tasks.

The loss function used in this study is expressed as follows:

(9)

Experiments

Data preprocessing

This experiment uses Amazon Product datasets, including (a) Clothing, Shoes and Jewelry, (b) Sports and Outdoors, and (c) Baby [12,34]. These datasets [35] include 4096-dimensional product image data, 1024-dimensional product information fields, and user-product interaction information. The product information field uses the content of the title, descriptions, categories, and brand fields.

First, we read the compressed JSON file and store the metadata and user-item interaction data in an uncompressed JSON file. We extract unique user and item IDs from the review data, assign different identifiers to them, and establish a mapping relationship between users, items, and their respective IDs. This mapping can be used to construct a user-item interaction dictionary to obtain a normalized adjacency matrix suitable for graph neural networks. The adjacency matrix clarifies the interaction relationship between users and items, providing key information for subsequent user behavior analysis and recommendation system modeling.

In addition, we randomly matched the products that each user had reviewed (positive-pair) with the user-item that had no interaction (negative-pair), and divided them into 80% training set, 10% validation set, and 10% test set. The SentenceTransformer [36] library and the pre-trained BERT model were used to efficiently extract feature vectors from text data. We also performed data cleaning and read the pre-extracted product image features from binary files. If a product did not have image features, the average of all product image features was used as a substitute. This ensured that each product had a corresponding image feature vector, thereby maintaining data integrity and consistency across the dataset.

Amazon product data was ultimately converted into user-item interaction data, product text features, and product image features.

Baselines

To evaluate the effectiveness of the proposed model, we compared it with multiple state-of-the-art recommendation models. These baseline models are divided into collaborative filtering methods (i.e., ItemKNN, MF, NGCF, LightGCN, and SGL) and deep content-aware recommendation models (i.e., VBPR, MMGCN, GRCN, and MICRO).

* ItemKNN [37]: An item-based collaborative filtering method that recommends items to users by calculating the similarity between items.

* MF [4]: A matrix factorization method that uses Bayesian personalized ranking (BPR) loss to optimize the user-item interaction matrix, utilizing only direct user-item interaction information.

* NGCF [5]: Neural graph collaborative filtering learns embedded representations of users and items by constructing a user-item bipartite graph and utilizing a graph convolutional network to capture the high-order connectivity between users and items.

* LightGCN [6]: A lightweight GCN simplifies the design of the GCN, removes feature conversion and nonlinear activation, and retains only two core components: graph convolution and layer combination, which effectively improves recommendation performance.

* SGL [7]: Self-supervised graph learning enhances node representation learning by generating multiple views of a node and maximizing the consistency between different views of the same node.

* VBPR [13]: Based on the BPR model, the visual features and ID embeddings of the items are combined and input into the matrix factorization framework as a representation of the items to improve recommendation performance.

* MMGCN [14]: A multimodal graph convolutional network that builds a specific user-item interaction graph for each modality, captures modality-specific user preferences and item representations through graph convolution operations, and finally aggregates all modality-specific representations to predict user and item interactions.

* GRCN [15]: Graph-refining convolutional network that refines the user-item interaction graph in multimodal recommendation systems by identifying pseudo-positive feedback and pruning the corresponding noisy edges.

* MICRO [12]: Latent structure mining and contrastive modality fusion method mines the latent item relations hidden in multimodal features and designs a multimodal contrast framework to promote fine-grained multimodal fusion, thus enhancing the recommendation effect.

Evaluation protocols

To rigorously evaluate the performance of our multimedia recommendation model, we adopt three main evaluation metrics: Recall@K [38,39], NDCG@K [40–42], and Precision@K [2,43] where K is set to 20. These metrics provide comprehensive insights into the recall, ranking accuracy, and precision of the model, respectively.

The sampling process is carefully designed to ensure the balance and representativeness of the dataset. Users are randomly sampled from the existing user base, and for each sampled user, a positive item (Interacted items) and a negative item (uninteracted item) are randomly selected from the user’s interaction history. Ensuring that each user has exactly one positive interaction item and one negative interaction item is critical to maintaining a balance between positive and negative samples during evaluation.

By leveraging these evaluation metrics and the carefully designed sampling process, we are able to conduct a comprehensive and rigorous evaluation of the performance of our multimedia recommendation model.

Implementation details

The PyTorch framework [44] was used to build our model, setting the embedding dimension of all the models involved in the comparison to 64. The models were trained using the AdamW optimizer [45], and the batch size was fixed at 1024. The Xavier initialization strategy [46] was used to initialize the model parameters and promote effective learning of the model. To determine the optimal hyperparameter combination, a detailed grid search was performed on the validation set. The learning rate was fine-tuned to 0.0005. In the kNN sparsification step, the k value was set to 10 to balance information retention and computational efficiency. In addition, the Lambda Coefficient λ of the skip connection was set to 0.8 to find the best balance between preserving the original graph structure information and introducing the learned graph structure information. The temperature parameter τ was set to 0.5 to adjust the similarity calculation in the contrastive loss. The weight coefficient of the BPR loss was set to 1; the weight coefficient of the Regularization Loss was set to 1*10⁻⁵; the weight coefficient of the Contrastive Loss was set to 0.03; and the weight coefficient of the Fusion Loss was set to 1*10⁻⁵. To avoid overfitting of the model during training, an early stopping strategy was adopted such that if the Recall@20 metric on the validation set failed to improve within 10 consecutive training cycles (epochs), the training process was terminated early. This strategy ensures the generalization ability of the model while reducing the unnecessary consumption of computing resources.

Results

Performance comparison

In this section, we comprehensively compare the performance of the proposed CFMM model with those of the baseline MICRO and other advanced methods. Specifically, we focus on evaluating the overall performance under different categories to demonstrate the effectiveness and improvement provided by CFMM.

Table 1 presents the overall performance comparison results of the proposed CFMM model with various state-of-the-art multimedia recommendation methods and other collaborative filtering (CF) baselines on three real datasets.

[Figure omitted. See PDF.]

Performance comparison of different learning rates

In this section, we used three real-world datasets, Clothing, Sports, and Baby, to train the CFMM model, and set three different learning rates (0.005, 0.0005, and 0.00005) to observe the specific impact of the learning rate on the model performance.

Three different learning rates (0.005, 0.0005, 0.00005) were selected to explore the impact of learning rate on model convergence speed, training stability and generalization ability. A larger learning rate (0.005) can converge quickly but is prone to overfitting, a smaller learning rate (0.00005) converges slowly and has poor learning effects, and the middle value (0.0005) achieves the best balance between convergence speed and model performance, ensuring stable training and good generalization ability.

As shown in Fig 5, in order to comprehensively evaluate the performance of the model under different learning rates, in this section we use training loss and Recall@20 as the main performance evaluation indicators. Training loss is used to measure the degree of fit of the model, and Recall@20 is used to measure the proportion of products that users have actually interacted with among the top 20 recommended products.

[Figure omitted. See PDF.]

The performance of the CFMM method in terms of training loss and validation recall under three different learning rates.

Training loss analysis.

The following is a detailed analysis and conclusion summary of the training loss performance of CFMM in Fig 5 under three different learning rates.

Learning Rate is 0.005: The training loss of the CFMM method on the three datasets all began to decline rapidly within 20 epochs. Due to the fast convergence speed, the algorithm completed the training within 100 epochs, and the final training loss value was relatively small. However, despite the good performance at this learning rate, there may be a risk of overfitting, because too fast convergence may cause the model to perform well on the training data, but poor generalization ability on the test data.

Learning Rate is 0.0005: The training loss of the CFMM method on the three datasets all steadily decreased within 50 epochs, and stabilized after 100 epochs, showing a stable convergence process. Compared with the other two learning rates, this learning rate has the most training epochs, ensuring that the model has enough training time. The final training loss value is similar to that when the learning rate is 0.005, both reaching a smaller value and more stable. The performance at this learning rate is the most ideal, because it ensures sufficient training of the model and a smaller training loss value while maintaining a faster convergence speed. In addition, a stable convergence process also helps improve the generalization ability of the model.

Learning Rate is 0.00005: The training loss of the CFMM method on the three data sets decreases very slowly, and the training loss value only decreases by about 20 within 100 epochs. Due to the slow convergence speed, the algorithm stops training around 100 epochs. The final training loss value is larger than that of the other two learning rates, indicating that the model has poor learning effect at this learning rate.

In summary, when the learning rate is 0.0005, the CFMM method performs best on the three data sets. It can maintain a faster convergence speed while ensuring sufficient training of the model and a smaller training loss value.

Recall analysis of validation set.

The following is a detailed analysis and conclusion summary of the validation set Recall performance of CFMM under three different learning rates in Fig 5.

Learning Rate is 0.005: On larger datasets (such as Clothing and Sports), the recall of the CFMM algorithm on the validation set initially showed an upward trend, then dropped sharply after reaching a peak, almost returning to the initial level. This shows that the learning rate caused the model to quickly learn some features at the beginning of training, but then overfitting occurred, resulting in a sharp drop in performance on the validation set. On small-scale datasets (such as Baby), the recall of the validation set has been showing a downward trend, indicating that the learning rate may be too high for small-scale datasets, resulting in the inability of the model to learn effectively.

Learning Rate is 0.0005: On all datasets (Clothing, Sports, Baby), the recall performance of the CFMM algorithm on the validation set is relatively stable and good. For larger datasets, the recall of the validation set quickly rebounds after an initial decline and reaches a stable state; for smaller datasets, the recall of the validation set rises slowly and reaches a stable state. In all cases, this learning rate makes the recall of the validation set reach a level that other learning rates have not reached, indicating that the model under this learning rate has better generalization ability.

Learning Rate is 0.00005: On all datasets, the recall of the CFMM algorithm on the validation set was initially low, but then rose rapidly within 50 epochs. However, after reaching the highest point, the recall of the validation set began to slowly decline on the larger datasets, while it gradually leveled off on the smaller datasets. This may be due to the low learning rate, which caused the model to learn too slowly and fail to fully converge to the optimal solution during training.

In summary, when the learning rate is 0.0005, the CFMM algorithm has the best Recall performance on the validation set, is relatively stable and has good generalization ability.

Comparison with and without fusion loss

To verify the impact of Fusion Loss proposed in this study on CFMM performance, experiments were conducted on three widely used real-world datasets (Clothing, Sports, and Baby), and the performance of the model was tested with and without Fusion Loss. As a key innovation of CFMM, Fusion Loss aims to further improve the accuracy of recommendation systems by optimizing the multimodal feature fusion process. Its location is shown in Fig 6. The evaluation indicators used were Recall@K and NDCG@K, where K was set to 20 to evaluate the performance of the model at the top of the recommendation list.

[Figure omitted. See PDF.]

The red stars mark the specific locations of the Cross-Attention and Fusion Loss modules, and the yellow stars mark the specific locations of the four loss parameters and the Lambda coefficient.

The experimental results are shown in Table 2, which shows the Recall@20 and NDCG@20 scores obtained by the CFMM model with and without Fusion Loss on three datasets.

[Figure omitted. See PDF.]

The results demonstrate that the introduction of Fusion Loss improves the performance of the CFMM model across all three datasets. Specifically, for the Clothing dataset, Recall@20 increased by 0.16% and NDCG@20 increased by 0.06% when Fusion Loss was included. For the Sports dataset, with Recall@20 increasing by 0.05% and NDCG@20 increasing by 0.13%. Similarly, for the Baby dataset, Recall@20 and NDCG@20 improved by 0.56% and 0.74%, respectively. These improvements suggest that Fusion Loss effectively optimizes the multimodal feature fusion process, enabling the CFMM model to generate more accurate and relevant recommendations.

In summary, the experimental results provide evidence that the Fusion Loss function proposed in this paper plays a crucial role in enhancing the performance of the CFMM model. On large-scale datasets, Fusion Loss can significantly improve the quality of fused features by optimizing the fusion process of multimodal features. Due to the large amount of data, the model has more opportunities to learn effective fusion methods between different modal features, thereby generating more accurate and useful fusion feature representations. On small-scale datasets, the model is prone to overfitting to limited data. Fusion Loss introduces the concept of relative ranking to constrain the ranking relationship of fused features, which helps prevent the model from overfitting to the training data and improves the model’s generalization ability on unseen data. By integrating Fusion Loss into the model, we are able to achieve significant improvements in both recall rate and ranking precision on multiple real-world datasets.

Comparison with and without cross-attention

In this section, we explore the performance differences of the CFMM model with and without the Cross Attention module to comprehensively evaluate the impact of cross-attention on model recommendation performance. Its location is shown in Fig 6. The experimental results are based on three widely used real-world datasets: Clothing, Sports, and Baby. Recall @K and NDCG@K were used as evaluation indicators, with K set to 20, to evaluate the model performance.

The Cross-Attention module can capture the interactive relationship between the multimodal features of items and users, thereby achieving more effective feature fusion and recommendation, which has a positive impact on the performance of CFMM on the Closing and Sports datasets. Therefore, in large-scale data sets, due to the huge amount of data, product information usually contains multiple modalities, and the information within each modality is also very rich. The Cross-Attention mechanism can make full use of these rich multimodal features, and achieve deep feature fusion by calculating the attention weights between different modal features, thereby capturing more complex and detailed user preferences and product characteristics. However, the performance on the Baby dataset suggests that in practical applications, the model structure and parameter settings should be flexibly adjusted according to the characteristics of the dataset to achieve the best recommendation effect. The applicability of the two modules in the recommendation system is reflected in the deepening of multimodal feature interaction and the optimization of feature extraction, which is also demonstrated by the experimental results.

The experimental results are shown in Table 2, the results show that on large-scale datasets (such as Clothing and Sports), the CFMM model with the cross-attention mechanism improves both the Recall@20 and NDCG@20 indicators. For the Clothing dataset, after adding Cross-Attention, Recall@20 increased by 1.00% and NDCG@20 increased by 0.25%; for the Sports dataset, Recall@20 increased by 0.39% and NDCG@20 increased by 0.22%. This further verifies the effectiveness of cross-attention in processing complex datasets, especially in improving the accuracy and ranking capabilities of recommendation systems.

Notably, on the Baby dataset, the Cross-Attention module did not improve Recall@20 and NDCG@20, but decreased slightly. This may be due to the limited data volume of the Baby data set, as well as related data characteristics and distribution. This result shows that the Cross-Attention module may perform differently on different types of data sets and needs to be adjusted and optimized according to the characteristics of the specific data sets.

In summary, the Cross-Attention module can significantly improve the recommendation performance of the CFMM model; however, the performance differs on different datasets. In the future, more in-depth research should be conducted on optimization and adjustments based on the characteristics of specific datasets to further improve the accuracy and efficiency of recommendation systems.

Comparison of different fusion loss parameters

In order to deeply explore the specific impact of Fusion Loss parameter settings λ_F on the performance of the CFMM model, we conducted detailed experiments on three datasets: Clothing, Sports, and Baby, and tested the model performance under different Fusion Loss parameters (1*10⁻³, 1*10⁻⁵, 1*10⁻⁷), aiming to fully explore the impact of parameters on model performance and verify the effectiveness of Fusion Loss under different settings. The location of this parameter is shown in Fig 6.

As shown in Table 2, in the key parameter tuning strategy section, the focus is mainly on the optimization selection of five core parameters. We carefully set the parameters of BPR Loss, Regularization Loss, and Contrastive Loss based on the previous studies and model characteristics. To enhance the influence of original data in model updates, we set the Lambda Coefficient to fuse the original features with the new features at a ratio of 8:2 to ensure stability during model updates. When determining the Fusion Loss weight, we tested different parameter ranges and finally selected 1 × 10⁻⁵, which performed best on most data sets, to effectively fuse multimodal features.

The experimental results are shown in Table 2 and Fig 7, where the blue histogram in Fig 7 represents the Recall@20 indicator and the red histogram represents the NDCG@20 indicator.

[Figure omitted. See PDF.]

Impact of different Fusion Loss parameter settings on Recall@20 and NDCG@20 performance metrics on the Clothing, Sports, and Baby datasets.

Recall@20 comparison.

In terms of the Recall@20 indicator, the experimental results clearly show that the Fusion Loss parameter has a significant effect on the recall rate of the model.

Specifically, Clothing dataset: Recall@20 reached the highest value of 0.0796 when the Fusion Loss parameter was set to 1*10⁻⁵. This is an improvement compared with the 1*10⁻³ parameter settings.

Sports dataset: Similarly, when the Fusion Loss parameter was 1*10⁻⁵, Recall@20 reached the maximum value of 0.1012.

Baby dataset: Although the changes in the Recall value under different Fusion Loss parameters are relatively small, the 1*10⁻⁵ parameter still achieved the best performance (0.0897).

NDCG@20 analysis.

For the NDCG@20 metric, the experimental results show some interesting findings:

Clothing dataset: Unlike Recall@20, NDCG@20 did not reach its maximum value when the Fusion Loss parameter was 1*10⁻⁵, but reached its maximum value (0.0358) when it was 1*10⁻³, indicating that a larger Fusion Loss parameter (such as 1*10⁻³) on the Clothing dataset may be more conducive to optimizing the sorting accuracy.

Sports dataset: In the Sports dataset, when the Fusion Loss parameter is 1*10⁻⁵, NDCG@20 reaches its maximum value of 0.0465, indicating that this parameter setting significantly improves the sorting accuracy on this dataset.

Baby dataset: In the Baby dataset, when the Fusion Loss parameter is 1*10⁻⁵, the NDCG@20 value is not the highest but the lowest (0.0407) among all test parameters. This phenomenon may be caused by the fact that the Baby dataset has unique data distribution characteristics, which makes certain model parameter settings not always optimal for this dataset. Compared with the Clothing and Sports datasets, the Baby dataset may have fewer features, so the setting of the Fusion Loss parameters needs to closely match the characteristics of the dataset to maximize the sorting quality.

Comprehensive analysis.

The selection of the Fusion Loss parameters has a significant impact on model performance, and in most cases, appropriately reducing the Fusion Loss parameter (such as 1*10⁻⁵) can improve the recall rate of the model (Recall@20). However, in terms of ranking accuracy (NDCG@20), the optimal parameters may vary depending on the characteristics of the dataset.

Comparison of CFMM using different parameters

Through comparative experiments using different parameters on the CFMM model, the study found that parameters such as BPR Loss, Regularization Loss, Contrastive Loss, and the Lambda coefficient in the feature map skip connection have a significant impact on model performance, as shown in Table 2 and Fig 8. The location of these parameters is shown in Fig 6. Choosing appropriate parameter settings can optimize the recommendation performance of the model and achieve better performance on Recall, NDCG and other indicators. These findings provide valuable reference for parameter tuning of CFMM models in practical applications.

[Figure omitted. See PDF.]

This figure shows the Recall and NDCG performance indicators of the CFMM model on the Clothing (Blue), Sports (Green) and Baby (Red) datasets when using different hyperparameters of the BPR (a), Regularization (b), Contrastive (c) Loss function and different Lambda Coefficient (d). The solid line in the figure represents the Recall value using the current dataset, and the dotted line represents the NDCG value using the current dataset. The horizontal axis represents different parameters or coefficients, and the vertical axis corresponds to the Recall and NDCG values respectively.

Different BPR loss parameters.

It can be clearly seen from Table 2 and Fig 8 (a) that when the BPR Loss λ_B is set to 1 parameter, the Recall and NDCG values of the CFMM model on the three data sets reach the highest.

Experimental results show that when BPR Loss is 1, the recall and ranking capabilities of the model can be well balanced. Too small a loss parameter (such as 0.1) may cause the model to underfit the training data, while too large a loss parameter (such as 10) may cause the model to overfit the training data and reduce generalization ability. When the parameter is 1, it is at a better balance point, which not only ensures the fitting effect of the model but also avoids the risk of overfitting. Different data sets have different data distributions and characteristics, and when the parameter is 1, it can show good performance on the three data sets of Clothing, Sports and Baby, which shows that this parameter setting has certain universality and data adaptability.

Different regularization loss parameters.

It can be clearly seen from Table 2 and Fig 8 (b) that when the Regularization Loss parameter λ_R is set to 1*10⁻⁵, the Recall and NDCG values of the CFMM model on the three data sets are the highest.

Experiments show that this parameter setting (1*10⁻⁵) can more effectively optimize the recommendation performance of the model and enable the model to better capture the potential relationship between users and products. A too small Regularization Loss parameter (such as 1*10⁻⁷) may cause the model to be too complex and easy to overfit the training data; while a too large Regularization Loss parameter (such as 1*10⁻³) may inhibit the learning ability of the model and lead to performance degradation. As a moderate value, 1*10⁻⁵ can achieve a good balance between model complexity and performance.

Different contrastive loss parameters.

It can be clearly seen from Table 2 and Fig 8 (c) that when the Contrastive Loss parameter λ_C is set to 0.03, the Recall and NDCG values of the CFMM model on the three data sets reach the highest.

Contrastive Loss is a loss function used to enhance the difference between positive and negative samples, especially when dealing with multimodal data. In the CFMM model, Contrastive Loss is used to constrain the fusion of multimodal information, thereby improving the performance of the recommendation system. Too large a parameter value (such as 0.3) will make the model pay too much attention to the difference between positive and negative samples, causing the model to be too sensitive during training and prone to overfitting, especially on datasets with small data volumes. Too small a parameter value (such as 0.003) will make the model insensitive when distinguishing between positive and negative samples, resulting in the difference between positive and negative samples not being obvious enough, thus affecting the recommendation performance of the model. When the parameter value is 0.03, the model can maintain a moderate degree of distinction between positive and negative samples, neither too strict nor too loose, which is conducive to learning better feature representations. This shows that a moderate Contrastive Loss parameter setting helps the model maintain a moderate degree of distinction between positive and negative samples, thereby improving the overall performance of the recommendation system. Therefore, using a Contrastive Loss parameter of 0.03 in the CFMM model is a more reasonable choice.

Different lambda coefficient of skip connection.

Lambda Coefficient is a hyperparameter that controls the mixing ratio of the original feature map and the newly generated feature map in the skip connection. It is used in the update process of image and text feature maps to achieve smooth update of feature maps by weighting the original feature map and the feature map calculated based on the new similarity matrix. When the value of Lambda Coefficient is close to 0, the newly calculated feature map dominates the update process and the original feature map has less influence. When the value of Lambda Coefficient is close to 1, the original feature map dominates the update process and the newly calculated feature map has less influence, the experimental results are shown in Table 2 and Fig 8 (d).

In the CFMM model, Lambda Coefficient (λ) is a key hyperparameter used to balance the information fusion ratio between the original feature map and the map obtained through training. Although from the perspective of a single dataset, λ = 0.5 performs best on the Clothing and Sports datasets, λ = 0.8 performs best on the Baby dataset. In order to select a comprehensive optimal λ value that is suitable for all datasets, we need to consider the overall generalization ability and stability of the model. Selecting a λ value that performs well on all datasets helps ensure the stability of the model. λ = 0.8 has less performance fluctuations on the three datasets, showing higher stability. Therefore, considering the generalization ability and stability comprehensively, we can consider λ = 0.8 to be a better choice.

Conclusion

The CFMM model significantly enhances recommendation performance by deeply integrating multimodal information and optimizing feature fusion. It introduces a Cross-Attention mechanism for deep multimodal feature fusion and a comprehensive loss function for training optimization. Extensive experiments on three datasets show CFMM outperforms baselines, verifying its effectiveness and generalization. Key component analysis further highlights the importance of Cross-Attention and the loss function design. Future work will explore CFMM’s applicability in more domains.

Although the CFMM algorithm shows potential in multimodal data processing and has moderate algorithm complexity, its performance is highly dependent on data quality and is easily affected by noise and missing data. Its performance on data sets of different sizes varies slightly. At the same time, the algorithm training requires good hardware support and a lot of experiments. In addition, its internal mechanism is opaque and its interpretability is limited, which limits its application in some scenarios. Future research needs to optimize these challenges.

Acknowledgments

The authors would like to acknowledge the instrumental and technical support of School of Computer and Artificial Intelligence, Beijing Technology and Business University.

References

1. 1. Zhang J, Liu Q, Wu S, Wang L. Mining Stable Preferences: Adaptive Modality Decorrelation for Multimedia Recommendation. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval; Taipei, Taiwan: Association for Computing Machinery; 2023. p. 443–52.

2. 2. Adomavicius G, Tuzhilin A. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng. 2005;17(6):734–49.

* View Article

* Google Scholar

3. 3. Su X, Khoshgoftaar TM. A Survey of Collaborative Filtering Techniques. Advances in Artificial Intelligence. 2009;2009:1–19.

* View Article

* Google Scholar

4. 4. Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L. BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, 2009. 452–61.

5. 5. Wang X, He X, Wang M, Feng F, Chua T-S. Neural Graph Collaborative Filtering. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval; Paris, France: Association for Computing Machinery; 2019. p. 165–74.

6. 6. He X, Deng K, Wang X, Li Y, Zhang Y, Wang M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual Event, China: Association for Computing Machinery; 2020. p. 639–48.

7. 7. Wu J, Wang X, Feng F, He X, Chen L, Lian J, et al. Self-supervised Graph Learning for Recommendation. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual Event, Canada: Association for Computing Machinery; 2021. p. 726–35.

8. 8. Peng Y, Zhu W, Zhao Y, Xu C, Huang Q, Lu H, et al. Cross-media analysis and reasoning: advances and directions. Frontiers Inf Technol Electronic Eng. 2017;18(1):44–57.

* View Article

* Google Scholar

9. 9. Liu Q, Hu J, Xiao Y, Zhao X, Gao J, Wang W, et al. Multimodal Recommender Systems: A Survey. ACM Comput Surv. 2024;57(2):1–17.

* View Article

* Google Scholar

10. 10. Zhang Y, Zhu Z, He Y, Caverlee J. Content-Collaborative Disentanglement Representation Learning for Enhanced Recommendation. Proceedings of the 14th ACM Conference on Recommender Systems; Virtual Event, Brazil: Association for Computing Machinery; 2020. p. 43–52.

11. 11. Ma J, Zhou C, Cui P, Yang H, Zhu W. Learning disentangled representations for recommendation. Proceedings of the 33rd International Conference on Neural Information Processing Systems: Curran Associates Inc.; 2019;513.

12. 12. Zhang J, Zhu Y, Liu Q, Zhang M, Wu S, Wang L. Latent Structure Mining With Contrastive Modality Fusion for Multimedia Recommendation. IEEE Trans Knowl Data Eng. 2023;35(9):9154–67.

* View Article

* Google Scholar

13. 13. He R, McAuley J. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. AAAI. 2016;30(1).

* View Article

* Google Scholar

14. 14. Wei Y, Wang X, Nie L, He X, Hong R, Chua T-S. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. Proceedings of the 27th ACM International Conference on Multimedia; Nice, France: Association for Computing Machinery; 2019. p. 1437–45.

15. 15. Wei Y, Wang X, Nie L, He X, Chua T-S. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. Proceedings of the 28th ACM International Conference on Multimedia; Seattle, WA, USA: Association for Computing Machinery; 2020. p. 3541–9.

16. 16. Liao J, Liu T, Yin H, Chen T, Wang J, Wang Y. An integrated model based on deep multimodal and rank learning for point-of-interest recommendation. World Wide Web. 2021;24(2):631–55.

* View Article

* Google Scholar

17. 17. Liu Y, Yang S, Lei C, Wang G, Tang H, Zhang J, et al. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. Proceedings of the 29th ACM International Conference on Multimedia; Virtual Event, China: Association for Computing Machinery; 2021. p. 2853–61.

18. 18. Jung MC, Monteil J, Schulz P, Vaskovych V. Personalised Outfit Recommendation via History-aware Transformers. Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining; Hannover, Germany: Association for Computing Machinery; 2025. p. 633–41.

19. 19. Khalafaoui Y, Lovisetto M, Matei B, Grozavu N. CADMR: Cross-Attention and Disentangled Learning for Multimodal Recommender Systems. 2024. arXiv:2412.02295 [Preprint]. https://arxiv.org/abs/2412.02295

* View Article

* Google Scholar

20. 20. Wu C, Wu F, Qi T, Zhang C, Huang Y, Xu T. MM-Rec: Visiolinguistic Model Empowered Multimodal News Recommendation. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval; Madrid, Spain: Association for Computing Machinery; 2022. p. 2560–4.

21. 21. Liu Y, Zhang Y, Mao X, Zhou X, Chang J, Wang W, et al. Lithological Facies Classification Using Attention-Based Gated Recurrent Unit. Tsinghua Sci Technol. 2024;29(4):1206–18.

* View Article

* Google Scholar

22. 22. Tao Z, Wei Y, Wang X, He X, Huang X, Chua T-S. MGAT: Multimodal Graph Attention Network for Recommendation. Information Processing & Management. 2020;57(5):102277.

* View Article

* Google Scholar

23. 23. Sun R, Cao X, Zhao Y, Wan J, Zhou K, Zhang F, et al. Multi-modal Knowledge Graphs for Recommender Systems. Proceedings of the 29th ACM International Conference on Information & Knowledge Management; Virtual Event, Ireland: Association for Computing Machinery; 2020;1405–14.

24. 24. Qi L, Liu Y, Zhang Y, Xu X, Bilal M, Song H. Privacy-Aware Point-of-Interest Category Recommendation in Internet of Things. IEEE Internet Things J. 2022;9(21):21398–408.

* View Article

* Google Scholar

25. 25. Guo G, Wang H, Bell D, Bi Y, Greer K, Editors. KNN Model-Based Approach in Classification. On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE; 2003; Berlin, Heidelberg: Springer Berlin Heidelberg.

26. 26. Wu F, Souza A, Zhang T, Fifty C, Yu T, Weinberger K. Simplifying Graph Convolutional Networks. In: Kamalika C, Ruslan S, Editors. Proceedings of the 36th International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2019. p. 6861–71.

27. 27. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Computer Science. 2014.

* View Article

* Google Scholar

28. 28. He X, Liao L, Zhang H, Nie L, Hu X, Chua T-S. Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web; Perth, Australia: International World Wide Web Conferences Steering Committee; 2017. p. 173–82.

29. 29. Baltrusaitis T, Ahuja C, Morency L-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019;41(2):423–43. pmid:29994351

* View Article

* PubMed/NCBI

* Google Scholar

30. 30. Srivastava N, Salakhutdinov R. Multimodal learning with Deep Boltzmann Machines. Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2; Lake Tahoe, Nevada: Curran Associates Inc.; 2012. p. 2222–30.

31. 31. Koren Y, Bell R, Volinsky C. Matrix Factorization Techniques for Recommender Systems. Computer. 2009;42(8):30–7.

* View Article

* Google Scholar

32. 32. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929–58.

* View Article

* Google Scholar

33. 33. Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: Maria Florina B, Kilian QW, Editors. Proceedings of The 33rd International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2016. p. 1050–9.

34. 34. McAuley J, Targett C, Shi Q, Hengel Avd. Image-Based Recommendations on Styles and Substitutes. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval; Santiago, Chile: Association for Computing Machinery; 2015. p. 43–52.

35. 35. McAuley J. Amazon product data: UCSD; 2014. Available from: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html

* View Article

* Google Scholar

36. 36. Reimers N, Gurevych I, Editors. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. 2019 November; Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410

37. 37. Deshpande M, Karypis G. Item-based top-Nrecommendation algorithms. ACM Trans Inf Syst. 2004;22(1):143–77.

* View Article

* Google Scholar

38. 38. Zhang Y, Yang Q. A Survey on Multi-Task Learning. IEEE Trans Knowl Data Eng. 2022;34(12):5586–609.

* View Article

* Google Scholar

39. 39. Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press. 2008.

40. 40. Wang Y, Wang L, Li Y, He D, Liu T-Y. A Theoretical Analysis of NDCG Type Ranking Measures. In: Shai S-S, Ingo S, Editors. Proceedings of the 26th Annual Conference on Learning Theory; Proceedings of Machine Learning Research: PMLR; 2013. p. 25–54.

41. 41. Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst. 2002;20(4):422–46.

* View Article

* Google Scholar

42. 42. Cremonesi P, Koren Y, Turrin R. Performance of recommender algorithms on top-n recommendation tasks. Proceedings of the fourth ACM conference on Recommender systems; Barcelona, Spain: Association for Computing Machinery; 2010. p. 39–46.

43. 43. Ricci F, Rokach L, Shapira B, Kantor PB. Recommender systems handbook. Springer-Verlag. 2010.

44. 44. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: an imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems: Curran Associates Inc.; 2019;721.

45. 45. Kingma D, Ba J. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations. 2014.

46. 46. Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Yee Whye T, Mike T, Editors. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; Proceedings of Machine Learning Research: PMLR; 2010;249–56.

Citation: Zhang C, Zhang Z (2025) Cross-fusion activates deep modal integration for multimedia recommendation. PLoS One 20(7): e0327663. https://doi.org/10.1371/journal.pone.0327663

About the Authors:

Chong Zhang

Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

E-mail: [email protected]

Affiliation: School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing, China

ORICD: https://orcid.org/0000-0003-3906-3861

ZhiCai Zhang

Roles: Software, Validation, Visualization, Writing – original draft

Affiliation: School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing, China

[/RAW_REF_TEXT]

References

1. Zhang J, Liu Q, Wu S, Wang L. Mining Stable Preferences: Adaptive Modality Decorrelation for Multimedia Recommendation. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval; Taipei, Taiwan: Association for Computing Machinery; 2023. p. 443–52.

2. Adomavicius G, Tuzhilin A. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng. 2005;17(6):734–49.

3. Su X, Khoshgoftaar TM. A Survey of Collaborative Filtering Techniques. Advances in Artificial Intelligence. 2009;2009:1–19.

4. Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L. BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, 2009. 452–61.

5. Wang X, He X, Wang M, Feng F, Chua T-S. Neural Graph Collaborative Filtering. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval; Paris, France: Association for Computing Machinery; 2019. p. 165–74.

6. He X, Deng K, Wang X, Li Y, Zhang Y, Wang M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual Event, China: Association for Computing Machinery; 2020. p. 639–48.

7. Wu J, Wang X, Feng F, He X, Chen L, Lian J, et al. Self-supervised Graph Learning for Recommendation. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual Event, Canada: Association for Computing Machinery; 2021. p. 726–35.

8. Peng Y, Zhu W, Zhao Y, Xu C, Huang Q, Lu H, et al. Cross-media analysis and reasoning: advances and directions. Frontiers Inf Technol Electronic Eng. 2017;18(1):44–57.

9. Liu Q, Hu J, Xiao Y, Zhao X, Gao J, Wang W, et al. Multimodal Recommender Systems: A Survey. ACM Comput Surv. 2024;57(2):1–17.

10. Zhang Y, Zhu Z, He Y, Caverlee J. Content-Collaborative Disentanglement Representation Learning for Enhanced Recommendation. Proceedings of the 14th ACM Conference on Recommender Systems; Virtual Event, Brazil: Association for Computing Machinery; 2020. p. 43–52.

11. Ma J, Zhou C, Cui P, Yang H, Zhu W. Learning disentangled representations for recommendation. Proceedings of the 33rd International Conference on Neural Information Processing Systems: Curran Associates Inc.; 2019;513.

12. Zhang J, Zhu Y, Liu Q, Zhang M, Wu S, Wang L. Latent Structure Mining With Contrastive Modality Fusion for Multimedia Recommendation. IEEE Trans Knowl Data Eng. 2023;35(9):9154–67.

13. He R, McAuley J. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. AAAI. 2016;30(1).

14. Wei Y, Wang X, Nie L, He X, Hong R, Chua T-S. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. Proceedings of the 27th ACM International Conference on Multimedia; Nice, France: Association for Computing Machinery; 2019. p. 1437–45.

15. Wei Y, Wang X, Nie L, He X, Chua T-S. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. Proceedings of the 28th ACM International Conference on Multimedia; Seattle, WA, USA: Association for Computing Machinery; 2020. p. 3541–9.

16. Liao J, Liu T, Yin H, Chen T, Wang J, Wang Y. An integrated model based on deep multimodal and rank learning for point-of-interest recommendation. World Wide Web. 2021;24(2):631–55.

17. Liu Y, Yang S, Lei C, Wang G, Tang H, Zhang J, et al. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. Proceedings of the 29th ACM International Conference on Multimedia; Virtual Event, China: Association for Computing Machinery; 2021. p. 2853–61.

18. Jung MC, Monteil J, Schulz P, Vaskovych V. Personalised Outfit Recommendation via History-aware Transformers. Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining; Hannover, Germany: Association for Computing Machinery; 2025. p. 633–41.

19. Khalafaoui Y, Lovisetto M, Matei B, Grozavu N. CADMR: Cross-Attention and Disentangled Learning for Multimodal Recommender Systems. 2024. arXiv:2412.02295 [Preprint]. https://arxiv.org/abs/2412.02295

20. Wu C, Wu F, Qi T, Zhang C, Huang Y, Xu T. MM-Rec: Visiolinguistic Model Empowered Multimodal News Recommendation. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval; Madrid, Spain: Association for Computing Machinery; 2022. p. 2560–4.

21. Liu Y, Zhang Y, Mao X, Zhou X, Chang J, Wang W, et al. Lithological Facies Classification Using Attention-Based Gated Recurrent Unit. Tsinghua Sci Technol. 2024;29(4):1206–18.

22. Tao Z, Wei Y, Wang X, He X, Huang X, Chua T-S. MGAT: Multimodal Graph Attention Network for Recommendation. Information Processing & Management. 2020;57(5):102277.

23. Sun R, Cao X, Zhao Y, Wan J, Zhou K, Zhang F, et al. Multi-modal Knowledge Graphs for Recommender Systems. Proceedings of the 29th ACM International Conference on Information & Knowledge Management; Virtual Event, Ireland: Association for Computing Machinery; 2020;1405–14.

24. Qi L, Liu Y, Zhang Y, Xu X, Bilal M, Song H. Privacy-Aware Point-of-Interest Category Recommendation in Internet of Things. IEEE Internet Things J. 2022;9(21):21398–408.

25. Guo G, Wang H, Bell D, Bi Y, Greer K, Editors. KNN Model-Based Approach in Classification. On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE; 2003; Berlin, Heidelberg: Springer Berlin Heidelberg.

26. Wu F, Souza A, Zhang T, Fifty C, Yu T, Weinberger K. Simplifying Graph Convolutional Networks. In: Kamalika C, Ruslan S, Editors. Proceedings of the 36th International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2019. p. 6861–71.

27. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Computer Science. 2014.

28. He X, Liao L, Zhang H, Nie L, Hu X, Chua T-S. Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web; Perth, Australia: International World Wide Web Conferences Steering Committee; 2017. p. 173–82.

29. Baltrusaitis T, Ahuja C, Morency L-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019;41(2):423–43. pmid:29994351

30. Srivastava N, Salakhutdinov R. Multimodal learning with Deep Boltzmann Machines. Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2; Lake Tahoe, Nevada: Curran Associates Inc.; 2012. p. 2222–30.

31. Koren Y, Bell R, Volinsky C. Matrix Factorization Techniques for Recommender Systems. Computer. 2009;42(8):30–7.

32. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929–58.

33. Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: Maria Florina B, Kilian QW, Editors. Proceedings of The 33rd International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2016. p. 1050–9.

34. McAuley J, Targett C, Shi Q, Hengel Avd. Image-Based Recommendations on Styles and Substitutes. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval; Santiago, Chile: Association for Computing Machinery; 2015. p. 43–52.

35. McAuley J. Amazon product data: UCSD; 2014. Available from: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html

36. Reimers N, Gurevych I, Editors. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. 2019 November; Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410

37. Deshpande M, Karypis G. Item-based top-Nrecommendation algorithms. ACM Trans Inf Syst. 2004;22(1):143–77.

38. Zhang Y, Yang Q. A Survey on Multi-Task Learning. IEEE Trans Knowl Data Eng. 2022;34(12):5586–609.

39. Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press. 2008.

40. Wang Y, Wang L, Li Y, He D, Liu T-Y. A Theoretical Analysis of NDCG Type Ranking Measures. In: Shai S-S, Ingo S, Editors. Proceedings of the 26th Annual Conference on Learning Theory; Proceedings of Machine Learning Research: PMLR; 2013. p. 25–54.

41. Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst. 2002;20(4):422–46.

42. Cremonesi P, Koren Y, Turrin R. Performance of recommender algorithms on top-n recommendation tasks. Proceedings of the fourth ACM conference on Recommender systems; Barcelona, Spain: Association for Computing Machinery; 2010. p. 39–46.

43. Ricci F, Rokach L, Shapira B, Kantor PB. Recommender systems handbook. Springer-Verlag. 2010.

44. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: an imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems: Curran Associates Inc.; 2019;721.

45. Kingma D, Ba J. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations. 2014.

46. Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Yee Whye T, Mike T, Editors. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; Proceedings of Machine Learning Research: PMLR; 2010;249–56.

Word count: 12194

Show less

© 2025 Zhang, Zhang. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Recommendation systems play a significant role in information presentation and research. In particular, goods recommendations for consumers should match consumer psychology, speed up product search, and improve the efficiency of product transactions. Online platforms provide product information and interactive information between customers and products. However, the interactive modeling effect of the existing multimedia algorithms on this information must be improved, for instance, by deeply integrating product and interactive information. Accordingly, we propose a cross-fusion-activated multi-modal (CFMM) integration method for recommender systems to achieve deep fusion of product and user information. This method adds a cross-fusion module to fuse the features of different modalities through deep-feature fusion. A fusion loss function is further proposed to improve the recommendation performance of the network. Extensive experiments were conducted on three real-world datasets along with multiple ablation studies to illustrate the effects of the different modules. The experimental results show that the proposed method exhibits better recommendation performance, providing a maximum improvement of 3.8% in the recommendation performance metrics Recall@20, NDCG@20, and Precision@20 in comparisons with existing algorithms. This method realizes a deeper integration of multimodal information; however, the performance can be further improved by extending the multimodal information interaction algorithm to include product and user information.

Details

Title

Cross-fusion activates deep modal integration for multimedia recommendation

Author

Zhang, Chong

; Zhang, ZhiCai

First page

e0327663

Section

Research Article

Publication year

2025

Publication date

Jul 2025

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0327663

ProQuest document ID

3226650357

Cross-fusion activates deep modal integration for multimedia recommendation

Jump to:

Full text

Introduction

Methodology

Overall structure of CFMM method

Construction of item-item adjacency matrix

Cross-attention module

Input preparation.

Cross-attention for product information fusion.

Loss function

Experiments

Data preprocessing

Baselines

Evaluation protocols

Implementation details

Results

Performance comparison

Performance comparison of different learning rates

Training loss analysis.

Recall analysis of validation set.

Comparison with and without fusion loss

Comparison with and without cross-attention

Comparison of different fusion loss parameters

Recall@20 comparison.

NDCG@20 analysis.

Comprehensive analysis.

Comparison of CFMM using different parameters

Different BPR loss parameters.

Different regularization loss parameters.

Different contrastive loss parameters.

Different lambda coefficient of skip connection.

Conclusion

Acknowledgments

References

Abstract

Details

Suggested sources