Content area
Purpose
Improving the diversity of recommendation information has become one of the latest research hotspots to solve information cocoons. Aiming to achieve both high accuracy and diversity of recommender system, a hybrid method has been proposed in this paper. This study aims to discuss the aforementioned method.
Design/methodology/approach
This paper integrates latent Dirichlet allocation (LDA) model and locality-sensitive hashing (LSH) algorithm to design topic recommendation system. To measure the effectiveness of the method, this paper builds three-level categories of journal paper abstracts on the Web of Science platform as experimental data.
Findings
(1) The results illustrate that the diversity of recommended items has been significantly enhanced by leveraging hashing function to overcome information cocoons. (2) Integrating topic model and hashing algorithm, the diversity of recommender systems could be achieved without losing the accuracy of recommender systems in a certain degree of refined topic levels.
Originality/value
The hybrid recommendation algorithm developed in this paper can overcome the dilemma of high accuracy and low diversity. The method could ameliorate the recommendation in business and service industries to address the problems of information overload and information cocoons.
1. Introduction
Notably, recommender systems have been widely applied in social media, e-commerce, entertainment and other fields, owing to their ability to help users obtain desired information from massive amounts of data quickly and accurately, effectively solving the problem of information overload (Chen et al., 2021a). Different from search engines that require users to clearly express their own requirements, recommender systems actively propose suggestions as to what may interest a user, based on the user's data, such as past searches, purchase records and interactive behavior. Some examples include recommendations of goods based on past purchase records, videos based on the types and duration of videos viewed and song lists based on the type and frequency of music listened to. The use of a recommender system can not only enable users to quickly find preferred products in a mass of data but also help merchants increase sales performance (Adomavicius and Tuzhilin, 2005). Therefore, recommender systems have become an indispensable part of the development of the Internet environment for both business and personal use.
In the preliminary research of recommendation system, accuracy which is used to measure the matching degree between the search submitted by users and the returned recommended products is the main research interests (Jiang et al., 2014). Traditionally, recommender systems mainly improve different accuracies leveraging content-based, collaborative filtering (User-based CF, Item-based CF) and hybrid recommendation methods (Sarwar et al., 2001; Linden et al., 2003; Mcmahan et al., 2013). In the realm of entities, recommender systems are also widely used by Amazon, Taobao, JD and other well-known Internet companies to improve their sale performances. However, only having high accuracy of recommender system may result in a surplus of recommendation lists, which greatly reduces user satisfaction (Wang et al., 2020). Especially in information acquisition, in order to improve the accuracy of recommender systems, the news and videos browsed by users are more concentrated in a certain type, resulting in the emergence of “Information Cocoons”, which has a negative impact on users' absorption of the complete content and information.
An excellent recommender system should not only have the capacity to return items with high matching result with users' needs, but also identify users' personal preferences and recommend a diversified item list. Under the circumstances of e-commerce, the “long tail distribution” in website also verifies the importance of diversified sales of goods. The profits of many low-selling products account for more than 57% of total revenues, so it is more meaningful to management products in the environment of growing product network (Oestreicher-Singer et al., 2013). In recent years, scholars have begun to focus on the diversity of recommender systems. Some scholars have proposed to implement diversity recommendation based on social network considering users' social attributes and users' personal preferences that change with time (Xu, 2018; Cui et al., 2020; Chen et al., 2021c). While, several scholars try to achieve the comprehensive improvement of the diversity and accuracy of the recommendation system through multi-objective optimization (Chai et al., 2019; Ma et al., 2020). Those aforementioned works illustrate that the accuracy and diversity of recommender systems could achieve a good balance by leveraging some techniques.
However, in the document of recommender systems, the complexity of text is a crucial issue that restricts high accuracy and good diversity of recommend systems. Just likes a coin has two sides, high recommendation accuracy would lead to poor diversity of recommendation results (Adomavicius and Kwon, 2012). Therefore, when recommending topics, such as topics of journal papers, an attractive question is how to balance the accuracy and diversity in practice. To obtain topic recommendation, nature language processing shows powerful function. As widely known, most of the existing researches are based on text sentiment analysis, feature selection, semantic relationship or topic model (Bao et al., 2022). To realize topic diversity, this paper proposes a method based on the latent Dirichlet allocation (LDA) model which leverages locality-sensitive hashing algorithm to upgrade diversity of recommendation results. It adopts a hash function to process the relevance of the extracted topics by LDA model, and then calculates the similarity among text distribution and the word distribution under the corresponding topic. Integrating the advantages of topic model and local similarity, the recommendation system employing this algorithm could achieve both good accuracy and diversity.
The remains of this paper are organized as follows. In section 2, we summarize the related works on recommender systems from the perspective of accuracy and diversity in recent years, and the applications of topic models in multiple fields. In section 3, we briefly describe the related theories of LDA topic models and locality-sensitive hashing (LSH). Consequentially, the proposed algorithm in this paper is presented in detail. In order to prove the effectiveness of our method and its superiority to other methods, comprehensive experiments are conducted to the journal abstracts on Web of Science in section 4. Finally, in section 5, we summarize the paper and point out the direction of the future improvement.
2. Literature review
In recent years, scholars have documented research on the accuracy and diversity of the output quality of recommender systems to enrich the recommendation results and increase the satisfaction of users. The application of the topic model is also gradually diversified, involving emotion analysis, text clustering and other fields. In this section, we will review the relevant works from three aspects: recommendation methods to improve accuracy, recommendation methods to improve diversity and application of the topic model.
2.1 Recommendation methods to improve accuracy
The research on the accuracy of the recommendation system aims to improve the matching degree between the recommendation results and the user's requirements. All the traditional recommendation methods including content-based, collaborative filtering and hybrid recommendation methods are initially designed to obtain high recommendation accuracy. Additionally, some scholars have studied from the perspective of matrix decomposition, Yang et al. (2020) replaced zero values in user preference information by the system of equations according to the law and constraints of multiplication convergence, alleviating the problem of commodity attribute sparsity caused by the subjectivity of user comments. This brought meaningful values in matrix and help map the words of comments to corresponding topics. Shi et al. (2019) proposed a recommendation method called HERec by transforming nodes through a set of fusion functions and integrating them into an extended matrix factorization model. In the collaborative filtering algorithm, in order to solve the problem that the traditional web service recommendation method is difficult to deal with a large amount of data. Zhang et al. (2018) explored a coverage algorithm CA-QGS based on spark quotient space granularity analysis. In their works, complex algorithms have been improved to enhance the accuracies of recommender systems mostly in the field of business.
Meanwhile, some scholars have integrated different algorithms into recommender systems in domain of text mining. Qi et al. (2021) proposed a recommendation method based on the enhanced locality sensitive hashing, which solves the problems of sparse data for recommendation and inaccurate recommendation result caused by containing sensitive user information. By doing so, it realizes the recommendation of new users and existing users. Obviously, how to solve the problems of cold start and data sparseness is also a promising research direction in the field of text mining. The network formed by users or words has also been added into recommender systems to refine the recommendation results. Kong et al. (2021) proposed a recommendation system named VOPRec which uses text and network representation learning for unsupervised feature design and generates a recommendation list based on the similarity calculated by the paper vectors. Specially, the network analysis among papers can provide more structural information that would gain more precise recommendation. Liu et al. (2020a) put forward a keyword-driven and popularity-aware paper recommendation based on undirected paper citation graph, which solves the shortcoming of existing recommender systems ignoring the associations between different papers. It can be seen that the research on the accuracy of recommendation system has been plentiful. However, whether by optimizing the existing algorithms or introducing new algorithms to improve the accuracy of recommendation system, it is impossible to mine the topics that users may be interested in.
2.2 Application of topic model
Topic model is a statistical model that clusters the latent semantic structure of a corpus by using unsupervised learning. Among of them, latent Dirichlet allocation (LDA) model is one of the most commonly used models. In recent years, scholars have been investigating the improvement and application of topic models (Zhu et al., 2022; Ma et al., 2018). Liu et al. (2020b) generated high-quality topic model iLDA through an interaction strategy that combines subjective knowledge proposed by human experts with objective knowledge integrated by topic models. The output of LDA model operated to a document is the topic list. Different documents can be transformed to different topic instances that consist of topic matrix. Calculating the similarity between topics by Euclidean distance, cosine similarity or Jensen-Shannon divergence, one could obtain aggregated topic models to find similar topics and their relations among documents (Blair et al., 2020). Allahyari et al. (2019) combined the interrelationships between entities with topic models, and propose topic models for improving topic coherence.
Topic model can also be used for feature extraction and clustering of texts. After text transformation from words to topics, several clustering topics represent the features of the whole documents. Curiskis et al. (2020) extracted four features from word frequency inverse document frequency matrix and word embedding model, and applies them to short text clustering in online social network. The topics co-occurrence in one document illustrates their hidden relationships that can build topic networks. Wang et al. (2021) proposed a supervised topic model which integrates prior information through a specific generation network. Topic network analysis leads to topic evolutionary rules and clustering communities. Those findings can help to discover useful knowledge on topic level rather than lexical level.
Furthermore, some researchers combine topic models with sentiment analysis to discover user preferences expressed in text contents. Osmani et al. (2020) applied LDA topic model to pack-level sentiment analysis, enriching the research of topic model in the field of emotion analysis. Wangwatcharakul and Wongthanavasu (2020) studied a topic model based on dynamic environment to captain user preferences with the relevant topic evolution of the comment text. By enriching the topic model with emotional and dynamic information, it is more accurate to discover user preferences that can help to improve recommendation results. In addition, topic models have been applied in different scenarios. Based on topic modeling and sentiment analysis, social media content and social network were used to seek product opportunity or find topic change trend for enterprises and individuals (Jeong et al., 2019; Ho and Do, 2018). While, Chen et al. (2021a) examined corpus metadata with discovered topics and proposes a systematic corpus topic analysis method for scientific literature. Therefore, topic model is of importance to extract the features and relations among documents and give precise recommendation.
2.3 Recommendation methods to improve diversity
Different from the accuracy of the recommender system, the diversity of the recommender system is to help users discover items in an environment of information overload by recommending different categories of products to users. Some scholars realize the diversity of recommender systems by tuning the recommendation algorithm. Social network implies the connection among people that can broaden personalized recommendation (Xu, 2018). Li et al. (2021) extended a heterogeneous network that includes papers, locations, authors, terms, users and the relationships between these entities, and uses this network for personalized paper recommendation. The machine learning methods, such as clustering and association rules, have been suggested to recommend items with various types, such as clustering users with similar behavior and mining the relationships between different commodity collections. Wang and Chen (2021) developed an improved K-means algorithm for clustering mobile voice communication users to provide service recommendation. Chen et al. (2021b) adopted the association rules between hotel recommendation functions to evaluate the potential level of users which forms a two-stage personalized recommender system. Different communities of users and items, which have relations, could be recommended with those methods.
There are also studies to explore users' potential interests by considering their historical records, online comments and other attributes. Wang et al. (2020) described a diversified and efficient recommendation method based on historical usage records and location-sensitive hash technology. Location sensitive hashing is proved to be an original algorithm to attain diversity of recommender systems. Therefore, this paper extends this method with integrating topic model to achieve both high accuracy and diversity of recommender systems. Likely to our conception, Chen et al. (2021c) integrated the user's purchase sequence into the weighted attention flow network and proposes an algorithm called AFNPR to enrich the diversity of recommendation. In their algorithm, the weights of edges (W), a good indication of all the users' attention flow, were defined to reflect the correlation strength of two ends of an edge in the attention flow network. Some other algorithms have been reported to realizes users' personalized recommendations with the help of calculating user clustering or time correlation coefficient (Chen et al., 2020; Cui et al., 2020). Those novel recommendation systems strength global searching ability and expand recommendation categories.
In addition, in recent years, the accuracy and diversity of the recommender systems have been considered as a whole to seek the optimal solution. Jiang et al. (2014), Chai et al. (2019) and Ma et al. (2020) respectively used nonlinear binary integer programming, multi-level nested logit model, singular value decomposition, multi-objective immune algorithm, kernel density estimation and multi-objective optimization collaborative filtering algorithm to enforce both high accuracy and diversity. Those works enlighten the following researchers to design a hybrid method by integrating two or more different algorithms to balance the recommendation results.
3. Method and algorithm
This section first briefly introduces the latent Dirichlet allocation (LDA) topic model and locality-sensitive hash. It then details the proposed text recommendation algorithm based on these two algorithms. The LDA topic model extracts latent topics from a collection of documents. Locality-sensitive hash enables finding similar items in high-dimensional spaces. The proposed text recommendation algorithm applies LDA to extract topics from the existing texts in the database. It then generates a locality-sensitive hash code for each text based on its topic distribution. For a given input text, the algorithm applies LDA to obtain its topic distribution and hash code. It then searches the database for texts with similar hash codes to recommend relevant texts. In this way, the algorithm leverages both topic information and locality-sensitive hashing to identify texts with similar topics and contents, enabling effective text recommendation.
3.1 LDA topic model
The LDA topic model is a probabilistic model that uses semi-supervised machine learning and the bag-of-words approach to generate the probability of each topic and the words associated with each topic. In this model, each document is composed of independent words and each word comes from a particular topic. The topic of the document and the words within that topic follow a specific probability distribution. Through the Gibbs sampling technique, which performs Markov chain Monte Carlo sampling, the joint distribution of document-topic and topic-word is sampled to obtain the final distributions. The graph structure of the LDA topic model is represented based on the work of Blei et al. (2003) in Figure 1.
In this Figure 1, is the path of the topic that generates the n-th word of the m-th document. While, is the path to generate the n-th word from the m-th document based on the selected word topic . Among them, follow the Dirichlet distribution, and follow the multinomial distribution. It can be seen that is the path to the topic which generates all the words in document m. The probability of this process can be expressed as Formula (1). represents the generation of all words under all topics, the probability is expressed as Formula (2). The joint probability of the whole process is expressed as Formula (3):(1)(2)(3)
3.2 Locality-sensitive hashing (LSH)
Locality-sensitive hashing is a fast algorithm for approximate nearest neighbor searches on large-scale, high-dimensional data. It processes the data (either the raw data or extracted feature vectors) into a matrix and designs a hash function to map the high-dimensional data points. After mapping or projection transformation, a hash value for the data is obtained, also called a signature matrix. In the resulting hash table, two highly similar data points are likely to be hashed into the same bucket, while two dissimilar points are unlikely to fall into the same bucket. This allows efficient indexing and retrieval of approximate nearest neighbors from massive datasets.
The function of LSH needs to simultaneously comply with the conditions of Formula (4) and (5). Where, represent data objects with multi-dimensional attributes, and represents the degree of dissimilarity between two data which can also be understood as the distance between two data; and represent the hash values of the two data, , and , represent probability threshold and distance threshold respectively. The meaning of Formulas (4) and (5) represent that when the distance between the two data is close enough, the generated hash values have a high probability of being equal. It means that two data are likely to be similar with a high probability. In the hash table, similar data will be placed in the same bucket to distinguish the similarity between different categories of data.(4)(5)
Several distance metrics can be used to define hash functions, including the Jaccard distance, Hamming distance and cosine distance. Among these, the cosine distance is a commonly used measurement for locality-sensitive hashing. The cosine distance measures the difference between two data points by computing the cosine of the angle between their vector representations (Adomavicius and Tuzhilin, 2005). The smaller the angle between two vectors, the more similar they are. In the topic matrix, the cosine distance is typically used as the similarity function to measure the similarity between topic vectors. Text classification process based on LSH is shown in Figure 2. The numbers after the underline represent the serial numbers of different elements in the element library, for example, “topic_n” represents the n-th theme of one document, “document_n” represents the n-th document in the document library and “bucket_n” represents the n-th bucket.
3.3 Topic recommendation algorithm based on LDA and LSH
The proposed topic recommendation method consists of two parts: topic modeling and hash processing. To extract topics, we first calculate the topic distribution of the text content and the word distribution under each topic. In this work, we identify the topics of the text using LDA topic model. Then, we construct a topic-word co-occurrence matrix between the texts to calculate the similarity between the matrices using the locality-sensitive hashing (LSH) algorithm. Finally, similar topics are assigned to the same hash bucket according to their similarity probability. A topic recommendation algorithm based on LDA and LSH is proposed to achieve accurate and diverse text recommendations. The symbols and their meanings presented in this paper are shown in Table 1. The specific steps are as follows:
Step 1: Generate and merge topic
In this step, the LDA model is first used to perform topic analysis on each document in the DOC corpus to generate a topic result . This result contains the topic distribution and the word distribution under each topic. Where, i represents the text number in the document library, j represents the number of topics generated. The number of topics 'j' and the number of words 'n' in each document can be changed by setting the LDA hyperparameters, which are explained in detail in the experimental section on how to select the specific values. Each topic result is a list of words and their probabilities under that topic.
For example, the topic analysis result of the third text in DOC is , that is, the text has two themes {, }. The resulting list for each topic is , where each pair (w, p) represents the probability p of word w under this topic. The specific display result of is shown in Formula (6). The number of topics j of each document and the number of vocabulary n under the topic can be modified by setting parameters.(6)
Secondly, the topic consequence of one document in DOC is merged with the result of each remaining document. Assuming that the number of topics in each document is j, and the number of topic vocabulary is n, then j topics in and (the m-th document in DOC) are merged in pairs with the number of merged topics is .
When the number of document topics j is 2, the merge between the topic result of and the topic result of and the merged topic-vocabulary distribution and are as follows Formulas (7–10). Where, the vocabulary in topic is the same as in , and the vocabulary in topic is the same as in . The pseudo code for generating and merging topics is shown in Figure 3, where x represents the number of topics defined for each document.(7)(8)(9)(10)
Step 2: Calculate and build the hash table
In this step, (1) Based on the merging result from step 1, the word probabilities of topics with the same sequence number in the merged topic distributions of the two documents are extracted as data for calculating the hash table. (2) The high-dimensional probability data is reduced by constructing a hash function. Randomly generated judge vectors are used in the hash function. (3) The hash table of the document's topics is calculated using the hash function family. (4) Similar documents are assigned to the same buckets based on the similarities of their hash tables.
Taking the merged topic-vocabulary distributions and as an example, perform the calculation of step 2. First, the vocabulary probabilities of a pair of merged topics and in and are extracted as the input data of this step, where the vocabularies are arranged in the same order. The adjusted lexical order of and are shown in Formulas (11–12), and the extracted probability data are shown in formulas (13–14). Both of them are 2n-r in length, where r represents the number of words the two merged topics have in common.(11)(12)(13)(14)
Then, the hash Formula (15) is constructed to reduce the dimension of the probability data and . Where, is a vector with the same length as and , k depends on the number of words 2n-r in the merged topic and is a number randomly generated from [−1,1]. The physical meaning of Formula (15) is that is a hyperplane used for segmentation. When the calculation results of vector and vector are same, that is, or , vector and vector are in the same plane of . Therefore, vector and vector can be considered as similar items. Where, the symbol represents the inner product of the two vectors.(15)(16)
The calculation results of and are 0–1 binary values. Because the judgment vector used for calculation is a randomly generated vector, the calculation result may be contingent with random values. To solve this problem, a family of functions H(i) containing multiple h(i) is constructed to improve robustness of the calculation results. In , each function h(i) corresponds to a judgment vector . B represents the number of functions in the function family. In , X represents a judgment vector family composed of B judgment vectors . Once the hash table is obtained, the similarity calculation of the two hash tables is performed to judge the similarity between two documents. Thereafter, the similar documents are placed in the same bucket. The pseudocode for calculating and building LSH table is shown in Figure 4.
3.4 Metrics
In this section, we mainly introduce how to use this method in the real-world data window to calculate the similarity between hash tables. Based on this, we develop the metrics of accuracy and diversity of the recommendation system to evaluate the performance of our method proposed in this paper. The framework of experiments is shown in Figure 5. For data sources, we crawled totally 4,000 different types of abstracts in four categories from Web of Science to test and evaluate our method. Firstly, the paper abstracts are processed by data cleaning to ensure the consistency of data, including deleting blank abstracts, garbled data and noise values. To ensure 1,000 abstracts in each category, we supplement the deleted quantity and keep 4,000 different types of abstracts at the end. Then, on the text level, the abstract is further processed through text segmentation and stop words removal. Text segmentation is to split a sentence into words for subsequent operations. In this paper, the NLTK library in Python is used for word segmentation. On this basis, the stop words are removed by using 'stopping-list', which is an auxiliary word without specific meaning in the sentence.
3.4.1 Accuracy
As we known, similar topics are located in the same hash table based on the topic-vocabulary matrix after document merging. The hash table is a vector composed of 0–1 variables that would be used to measure similarity of recommendations. Cosine similarity in Formula 17 and Pearson coefficient in Formula 18 are usually adopted to calculate the similarity between two vectors (Liu et al., 2014). The calculation results of cosine similarity and Pearson coefficient are in the range of [−1,1]. Because the Locality-Sensitive Hashing is a probability-based algorithm, the calculation results may be serendipitous. In order to minimize the error, it requires that the cosine similarity and Pearson coefficient of the combined topic-vocabulary hash table are both greater than 0.8 in our experiments.(17)(18)
The hash table of the merged document topic obtained in step 2 is the result calculated by the hash function for only one time. In this paper, step 2 is repeated t times to decrease the error. The similarity set of document is composed of documents whose occurrence time is greater than t/2 in the results after repeated t experiments. The calculation formula for evaluating the accuracy of the algorithm in this paper is as follows in Formula 19. The accuracy of the recommendation system is obtained by (1) counting the number of similar item pairs in the top 50% most similar document sets which is used by in Formula 19. To ensure the value of to be integer, the function is used in the experiments; (2) removing the minimum and the maximum, which is shown as in Formula 19; (3) calculating the average value divided by the total number of samples that is in Formula 19.(19)
3.4.2 Diversity
Recommendation diversity is gain from similar sets of documents and buckets with locality-sensitive hashing. If the document DOC contains more than one category, such as politics, economy, entertainment, science in Web of Science, when finding similar document of , there will be some similar documents that are different from ’s category. This kind of document is called first-level diversity document which is different from the in category, but similar to the under the subdivision topic. In the final result of locality-sensitive hashing, similar documents are enclosed in the same bucket. If the and the are in the same bucket, but is not in the similar set of , then is set to be a second-level diversity document of the .
Founded on the first-level diversity document and the second-level diversity document proposed above, the diversity formula of the text recommendation system is proposed in this paper as follows in Formula (20). To measure the diversity of the recommendations, we first need to calculate the number of diverse documents for each document in the sample. This includes first-level diverse documents (directly recommended, ) and second-level diverse documents (recommended through first-level documents, ). We then remove the maximum and minimum values from these numbers. Next, we sum up the remaining numbers of diverse documents and divide that sum by the total number of similar document sets for each document (). The quotient illustrates the overall diversity of the recommendation system.(20)
Adomavicius and Kwon (2012) proposed an indicator called “diversity-in-top-N” to calculate the total number of different items in the recommendations for all users. When “diversity-in-top-N” is high, it means that all users have received their own unique recommendation list. On the contrary, “diversity-in-top-N” is low, which means that there is no difference in the recommendations received by all users. In this paper, a category of documents is regarded as an aggregation. In order to judge whether the recognition ability of the model to diverse recommendation documents of aggregation is different, this paper calculates the proportion of diverse documents of different aggregation in the total candidate documents, which is called “aggregate diversity”. A high “aggregate diversity” means that users have a high chance of getting rich diversity recommendations, and a lower “aggregate diversity” means that users have a lower chance of receiving diversity recommendations.(21)
4. Experiments and analysis
4.1 Parameter setting
In the LDA topic model, the number of topics contained in each document, number of words under each topic and number of iterations of word sampling are three variables that need to be configured. Abstracts of journal papers, which have the characteristics of short length and a small number of topics, are selected as experimental data in this study. Each abstract is set to contain, at most, three topics, corresponding to the background, methods and results of the hypothetical. On this basis, experiments involving the number of words and number of iterations of LDA are conducted. To determine the most accurate solution under different parameter configurations, an experiment of text recommendation accuracy was performed by selecting the abstracts of 1,000 documents comprising four subjects from the Web of Science database.
Since an abstract describes one or two topics in their papers, each abstract is set to contain at most three topics in our paper. For example, an abstract examines the medicine using computer technology, it contains two topics (medicine and computer) in the same abstract. The accuracies in different times by using LDA models are different, so we repeat the model 100, 200, 300, 400 and 500 times to gain the average accuracies in Figure 6. The exact proper numbers of iterations are hard to achieve, we try 5 different numbers in our experiment. It has a better performance when the iteration time is 400 which can be found in Figure 6. With an increase in the number of words and number of iterations for a given topic, the accuracy increases and then decreases to stable states. The highest recommendation accuracy was obtained for 15 words and 400 iterations. Therefore, the parameters of the model are set in “genism” library in Python language as follows: topic number is 3, number of words is 15 and number of iterations is 400.
4.2 Analysis of experimental results
A specific text can be classified into different categories according to multiple modules such as politics, economy, sports, entertainment and technology. These categories are considered first-level categories in this study. There are also separate subdivisions under each first-level category. For example, sports can be subdivided into ball game, track and field, skill sports, chess and other fields. These sub-fields under the first-level category are transmitted to the second-level category. Meanwhile, there are more specific categories in each second-level category. For instance, there are many sports, such as soccer, basketball, tennis and table tennis in the field of ball game. These precise categories are considered third-level categories in this study. We collected text data for each category and used the model to conduct experiments to determine the recommendation performance for different categories of text.
Document in first-level categories
In this study, 1,000 abstracts are randomly extracted from each first-level topic in four subjects—“Infectious Diseases,” “Agriculture,” “Advertising,” and “Chemistry”—used to determine the recommendation accuracy and diversity of the model for the first-level categories. As shown in Figure 7(a), the recommendation accuracy of the proposed model for the selected four first-level categories lies between 0.575 and 0.7, according to Formula (19). Of these categories, “Advertising” exhibits significantly greater recommendation accuracy than the other three topics, presumably because the number of sub-fields in “Advertising” is small (31 second-level topics) and the vocabulary related to “Advertising” is relatively concentrated, such as advertisement, customer satisfaction, targeted adverting, TV, promotion, brand, privacy, elections, collaborative filtering, big data, etc. In contrast, the recommendation accuracy of “Infectious Diseases” is the lowest, as there are many types of infectious substances such as bacteria and viruses, resulting in a large number of sub-fields (88 second-level topics) and scattered related words.
The value of the abscissa in Figure 7(b) represents the level of mixing of the four different categories of documents, given by Formula (20). The accuracies are just calculated from one type documents, but the diversity and average diversity should consider four subjects of documents. It causes x-axes are different in Figure 7(a), (b) and (c). The recommendation diversity of the proposed model for selected first-level texts fluctuates between 0.70 and 0.76. The best performing topic, “Advertising,” in the accuracy experiment has relatively low diversity, whereas the worst performing topic, “Infectious diseases,” has relatively high diversity. As such, the recommendation accuracy of first-level topics under this model shows a weak negative correlation (correlation coefficient r = −0.2857) with diversity. In fact, increasing the number of experimental samples shows little effect on the accuracy and diversity of the categories, implying that there is no clear relationship between the number of topics and recommendation performance in the first-level category.
As illustrated in Figure 7(c), the aggregate diversity of the model ranges between 0.42 and 0.5 for the selected first-level topics, in which the arrangement order of the different topic curves and their trend with the increase in text data volume are similar to those in Figure 7(a). The results show that, in the first-level, although the proportion of diversified text identified by the model in the total dataset is low, accounting for only 40–50% of the total text, the proportion of diversified text in the similar list recommended to users is considerable, accounting for 70–80%.
Document in second-level categories
In this section, “Advertising” exhibiting the highest accuracy above is selected as the parent of the text of the second-level topics. It is classified into the following four research fields in the Web of Science: Business and Economy, Computer Science, Law and Social Science and Medicine. The abstracts of 1,000 papers are extracted from each research field as experimental data to determine the recommendation accuracy and diversity of the model in the second-level.
As shown in Figure 8(a), the text recommendation accuracy of the model for the four second-level topics under “Advertising” ranges between 0.62 and 0.74. The text in the field of computer science research related to advertising exhibits the lowest recommendation accuracy, whereas the abstracts in the other three research fields show higher accuracy in the model. These results may be related to the many types of computer technology related to advertising, such as data mining, machine learning, cloud computing, big data, blockchain, deep leaning, artificial intelligence, targeted, etc. resulting in a dispersion of the identified topics and vocabulary. In contrast, the application in the other three fields may be more biased toward the study of advertising effectiveness, resulting in a more concentrated list of text topics and vocabulary, e.g. the commercial benefits associated with different advertisements and the relationship between medical-related advertisements and public health.
The model's recommendation diversity for the selected second-level categories is between 0.70 and 0.77, as shown in Figure 8(b). The advertising texts related to the medical field have lower recommendation diversity than the other three fields in the model. Evidently, the four result curves of recommendation diversity are arranged in opposite order to that of the accuracy curves, implying that there is a negative correlation (correlation coefficient r = −0.5714) between them.
As depicted in Figure 8(c), the aggregate diversity of the selected second-level topics in the model ranges between 0.46 and 0.55, which is higher than that in the first-level. This shows that as the scope of the text category narrows, the number of diverse texts that the model can identify gradually increases. Among them, the text in the field of computer science related to advertising has the lowest aggregate diversity, whereas the abstracts of the other three research fields exhibit higher aggregate group diversity in the model.
Documents of third-level topics
This study realizes the precise division of text categories through a keyword search in the Web of Science. Another 1,000 abstracts were randomly extracted to obtain four third-level topics, namely, “blockchain,” “COVID-19,” “deep neural network,” and “sentiment analysis,” used to determine the recommendation accuracy and diversity of the model in the third-level topics.
As shown in Figure 9(a), the text recommendation accuracy of the model for the four third-level categories lies between 0.8 and 0.86. Clearly, the accuracy at the third-level is better than that at the first and second levels. In detail, the topics of “COVID-19” and “blockchain” show relatively low accuracy, whereas the topics of “deep neural network” and “sentiment analysis” exhibit high accuracy in the model. It is speculated that “deep neural network” is a clear research direction in the computer field, mainly including six basic models, namely, stacking automatic encoders, convolution neural networks, deep learning on graphs, deep probabilistic neural networks, deep fuzzy neural networks and generative adversarial networks. Meanwhile, there are also a limited number of research methods under “sentiment analysis,” mainly frequency-based and syntax-based methods along with supervised machine learning, unsupervised machine learning and hybrid approaches (Nazir et al., 2022). However, the first two categories contain numerous research directions and technologies with sparse topic distribution. For example, “COVID-19” research includes its structure, treatment method, mutation direction, comparison with other viruses and other pathological related directions, as well as its impact on the economy, politics and society.
In Figure 9(b), the recommendation diversity of the third-level category text in the model is shown to be 0.64–0.7. The topic “sentiment analysis” has the lowest diversity, whereas “COVID-19,” “blockchain,” and “deep neural network” exhibit higher recommendation diversity in the model. Different from the first-level and second-level categories, the correlation (correlation coefficient r = 0.0357) between the recommendation diversity and accuracy of the texts of the third-level category texts is not significant. Meanwhile, the increase in the number of third-level category texts has no impact on the recommendation accuracy but has a positive correlation (correlation coefficient r = 0.8929) with the improvement in recommendation diversity.
As exhibited in Figure 9(c), the aggregate diversity of the selected third-level topics in the proposed model ranges between 0.52 and 0.6, which is higher than that of the second-level topics, further indicating that as the scope of the text category narrows, the number of diverse texts that the model can identify increases. Of these categories, the documents pertaining to “deep neural network” exhibit the highest aggregate diversity, whereas text involving “COVID-19,” “blockchain,” and “sentiment analysis” has lower aggregate diversity. Meanwhile, there is a significant positive correlation (correlation coefficient r = 0.8929) between the quantity of text data and the aggregate diversity.
Comparing the above experimental results shows that the accuracy of text recommendation does not greatly fluctuate with the increase in experimental data in the three categories of texts. However, with an increase in the number of data sets, the diversity and Agg_Diversity of text recommendations exhibit gradually similar changing trends. For the first-level, there is no correlation between diversity and data volume. For the second-level, there is a weak positive correlation (r = 0.0357) between them, whereas, for the third-level, there is a significant positive correlation (r = 0.9370) between the recommendation diversity and data volume.
In addition, the model's recommendation accuracy for first-level, second-level and third-level category texts are (0.575, 0.7), (0.62, 0.74), (0.8, 0.86); recommendation diversity is (0.7, 0.76), (0.7, 0.77), (0.64, 0.7); and aggregate diversity is (0.42, 0.5), (0.46, 0.55), (0.5, 0.6), respectively. Clearly, as the scope of the text category narrows, the model's recommendation accuracy and aggregate diversity for the text slightly increases, and the recommendation diversity gradually decreases. This phenomenon indicates that the diverse texts identified by the model do not decrease with the narrowing of the scope of topics, but its growth rate is less than similar texts. Figure 10 shows a comparison of the accuracy, diversity and aggregate diversity of each category of texts in the model for 1,000 texts. The horizontal line in the accuracy represents the recommendation result of the LDA method from the cosine similarity calculation.
By comparison, it is found that the model proposed in this study does not substantially differ from the average of the current content-based recommendation methods (CBR, average accuracy is 0.61) regarding recommendation accuracy for first-level topics (Adilaksa et al., 2020; Yao et al., 2015). The CBR benchmark methods use cosine similarity and weighted cosine similarity to recommend topics after operating TF-IDF technology and LDA model. However, for more refined topics, our method can identify more similar topics and vocabulary, resulting in higher recommendation accuracy. The recommendation accuracy of the model in the second-level and third-level category texts is approximately 10 and 20% higher, respectively, than content-based recommendation method. Furthermore, the important contribution of our method focuses on realizing diversity in recommender systems. It is concluded that the model proposed in this study has better recommendation accuracy and diversity for the four different types of topics (Infectious Diseases, Agriculture, Advertising and Chemistry) covered in this study.
5. Conclusions and future work
In this study, aiming to achieve both high accuracy and diversity in recommender systems, a hybrid method has been proposed by integrating the latent Dirichlet allocation model and a locality-sensitive hashing algorithm. The LDA model is first used to extract the topic distribution of the text and word distribution under the topic. The vocabulary probability between two topic vectors is merged to construct a data matrix. Then, the similarities within and between hash tables transformed from the matrix are calculated by using the locality-sensitive hashing algorithm to measure the diversity recommendation of the topics. The paper abstracts of four subjects in the Web of Science are crawled, yielding experimental data used to evaluate the effectiveness of our method. In particular, the paper categories are divided into three levels according to the degree of topic subdivision. Comparing the results of accuracy and diversity of recommender systems based on the metric functions applied in this study, we illustrate the superiority of our method and conclude several findings as follows.
First, the accuracy of recommender systems can be enhanced using our method when meticulously partitioning the topics. The experimental results demonstrate that, in the first-level category of topics, the model's recommendation accuracy is not significantly different from the recommendation accuracy using only the topic model. As the topics become more precise, the recommendation accuracies in the second-level and third-level categories increase by approximately 10 and 20%, respectively, over that of the benchmark method. This shows that the hybrid method explored this study has advantages in handling big data.
Second, diversity of recommender systems could be achieved without losing accuracy. At the three levels of topics, the diversities of recommender systems are found to be in the range of (0.7, 0.76), (0.7, 0.77), (0.64, 0.7), respectively. Clearly, the recommendation diversity is at a high level after employing our method. In addition, the recommendation accuracy increases with the subdivision of topics at the three category levels, demonstrating the effectiveness of integrating the latent Dirichlet allocation model and locality-sensitive hashing algorithm to achieve both high accuracy and diversity.
Third, recommendation diversity is shown to be more difficult to realize as topics become more detailed. With more refined category topics, the method does not outperform other categories in terms of diversity recognition in the experimental texts from the Web of Science. The effects of recommendation diversity recognition differ for different types of topics. The experimental results in the third-level category show that the topic diversity appears as a downward trend as the scope of the text category becomes relatively narrow. Therefore, the diversity and high accuracy of recommender systems should be balanced within a certain range of topics.
The main novelty of this study is the integration of the hashing method with a topic model to improve the topic diversity of recommender systems. Current approaches only focus on the accuracy of recommendation results, overlooking diversity in some fields that have high demand for topic recommendations (Adomavicius and Kwon, 2012; Jelodar et al., 2021; Fang et al., 2018). Locality-sensitive hashing algorithms are useful for finding similar data from massive high-dimensional datasets. This is the reason that we applied this method to upgrade the LDA model, which is known to augment the topic diversity of recommender systems in theory. The hybrid recommendation algorithm proposed in this study can overcome the dilemma of high accuracy and poor diversity in practice. It is used in our experiment to recommend diverse topics of journal abstracts in four subjects from the Web of Science, avoiding recommending topics only in the same subject. Furthermore, this method could improve the recommendations in business and service industries to address the problems of information overload and information cocooning.
In the future, three improvements are suggested as follows: (1) other dimensionality reduction algorithms should be considered to realize the processing of the text-topic–vocabulary merge matrix; (2) other text forms of longer length (e.g. news, novels, poems) should be used as experimental data; (3) deep learning algorithms, powerful tools to solve computational model complexity, should be adopted for greater amounts of text data.
This work is supported by the National Natural Science Foundation of China [Grant No.71871053] and the Fundamental Research Funds for the Central Universities. The authors thank anonymous referees who have given their constructive and valuable comments that substantially help improve the quality of this paper.
Figure 1
Graph structure of LDA based on the work of Blei et al. (2003)
[Figure omitted. See PDF]
Figure 2
LSH-based text categorization process
[Figure omitted. See PDF]
Figure 3
Pseudocode for generating and merging topics
[Figure omitted. See PDF]
Figure 4
Pseudocode for calculating and building LSH table
[Figure omitted. See PDF]
Figure 5
Framework of calculating text similarity with our method
[Figure omitted. See PDF]
Figure 6
Results of recommendation accuracy with different iterations
[Figure omitted. See PDF]
Figure 7
Accuracy and diversity of recommendation of first-level document
[Figure omitted. See PDF]
Figure 8
Accuracy and diversity of recommendations in second-level documents
[Figure omitted. See PDF]
Figure 9
Accuracy and diversity of recommendations of third-level documents
[Figure omitted. See PDF]
Figure 10
Accuracy and diversity of texts in different categories
[Figure omitted. See PDF]
Table 1
Definitions of all variables in this paper
| Symbol | Definition |
|---|---|
| DOC | A collection of all documents |
| The i-th document in DOC | |
| Distribution of topics in the | |
| Distribution of the topic in the | |
| The r-th topic vocabulary of the | |
| The probability of the | |
| Distribution of combined with | |
| Distribution of the j-th topic in the | |
| A LSH table of DOC | |
| H(•) | The LSH function family |
| A LSH function in H(•) | |
| The vector family | |
| The vector used for judgment in each function | |
| The number of documents in DOC | |
| The number of topics in | |
| The number of functions in H(•) |
Source(s): Table by authors
© Emerald Publishing Limited.
