BS-CP: Efficient streaming Bayesian tensor

Full text

Turn on search term navigation

Introduction

Tensors are multidimensional arrays, which naturally represent multiple modes or dimensional information. Tensor decomposition gives us the structure to analyse this information, by minimizing the reconstruction error of the observed tensor values to estimate a set of latent factors. From these factors, we can discover hidden patterns in data. In the meantime, these factors can be used as informative features in downstream tasks. Therefore, tensor decomposition is popular in many practical applications, such as weather prediction, medicine treatment effect prediction, and recommendation system. For example, we can build a three dimensions tensor data from people’s fitness records of (user, sport type, altitude), or we can get four mode tensor data from patients’ vaccination status like (area, patient, brand, time). More traditionally, we could take user ratings from an online platform and build tensor data, various from movies (MovieLens) to games (Steam video games).

CANDECOMP/PARAFAC (CP) is a widely used tensor decomposition method [1]. It splits a high-dimensional tensor into the sum of multiple rank-1 tensor, each of which is formed by a vector outer product. All tensor nodes have the same number of potential factors R. In each r of R, the r-th product is derived by computing the r-th latent factor of each involved node. In this way, we can significantly reduce the dimension of the parameter. Other commonly used decomposition methods such as tucker decomposition [2] and block term decomposition (BTD) [3], the former is a complicated version of CP, it can be seen as a higher-order version of principal component analysis (PCA), which decomposes a tensor into core tensors multiplied by the corresponding matrix on each dimension. The difference between CP and tucker is that the size of their core tensor, tucker tensor core has the size of R₁ × R₂ × R₃ × … × R_n, while CP tensor core has the size of R × R × R × … × R. BTD decomposition can be seen as a higher generalization of CP decomposition and tucker decomposition, assumes that a decomposition of Rank − (L_r, M_r, N_r), each member tensor is a Rank − (L_r, M_r, N_r) Tucker decomposition. It decomposes a Rank − (L_r, M_r, N_r) tensor into R member tensors. When R=1, there is only one member tensors, and this decomposition is a Tucker decomposition. When each member tensor is a Rank-1 decomposition, it reduced to a CP decomposition.

It can be seen that CP decomposition in S1(a) Fig enables the original tensor to be represented by multiple vectors, while Tucker decomposition in S1(b) Fig also retains a core tensor with the same dimension as the original tensor, and the dimension of the core tensor increases with the increase of the dimension of the original tensor, making the decomposed factors more complex. Despite the generalization ability due to their high complexity, given the relatively large core tensor in tucker decomposition and BTD decomposition, model estimates are much more expensive and cumbersome than CP, and because of the similar structure, CP decomposition would be a more economical and fast solution. In addition, when a batch of new entries is received, it is very costly to estimate the core tensor, let alone a rather large and indeterminate core tensor.

Classical tensor decomposition methods involve using alternate least squares (ALS) [4] or gradient optimization methods [5] to update the embeddings. However, these methods can only work on the static data, also, they need the whole dataset to perform each update operation. Processing tensor data in a streaming fashion has become a trend recently due to its highly time-efficiency [6, 7]. In this scenario, the target is to calculate the posterior distribution of the embeddings along with the incoming of the data. There are several ways of computing posterior distributions, such as Markov Chain Monte Carlo (MCMC) [8], Laplace’s method [9], variational Bayes (or variational inference) [10], etc. Assumed density filtering (ADF) is a single-order method for calculating an approximate posterior distribution [11]. Its key advantage is that it does not require linearization of the system model therefore suitable for highly nonlinear systems.

However, because of the calculation of integrals in high-dimensional spaces, ADF may be difficult to implement on higher dimensional space system or limited computational resources. In addition, due to its sequential feature, information discarded early may become important later. ADF is also sensitive to observation sorting, which is undesirable in a batch processing context. And, with limited computing resources, it can be difficult for ADF to achieve fast convergence and high accuracy.

In summary, there are several existing problems with approaches using tensor factorization in recommendation algorithms:

1. Facing the real-time streaming data on the Internet today, the method based on static data acquisition usually takes a long time to retrain or update the model, which cannot meet our needs in time.For example, the product recommendation system of an e-commerce platform needs to provide personalized recommendations based on users’ real-time browsing and purchase behavior. If the recommendation system is built on the basis of a user’s past purchase history, it may not well reflect the user’s latest interests and needs, which can affect the user experience and the platform’s sales performance. Similarly, companies need to monitor changes in sentiment on social media in order to respond to public opinion in a timely manner when conducting brand management. Sentiment analysis models based on static data sets may not be able to capture the mood swings caused by unexpected events or hot topics, resulting in enterprises unable to adjust market strategies in a timely manner.

2. Using methods such as deep learning makes the model lack interpretability and thus cannot explain its recommendation reasons well to the user. For example, in the medical field, if the model does not explain why an image is identified as a tumor, doctors may be skeptical of the model’s predictions, and this opacity may lead doctors to distrust AI-assisted systems, thereby affecting patient care. Moreover, a user may be recommended a completely unrelated item after browsing some products about fitness, and if the user does not understand the logic of the recommendation, it may cast doubt on the effectiveness of the recommendation system.

3. The accuracy of the algorithm needs to be improved. Tensor factorization models may be sensitive to noise in the data, which may lead to inaccurate recommendation results. Also, the accuracy of the mathematical calculation method itself improved to match more diverse user needs. For example, if the data set contains false or untrue ratings (noisy data), such as an extreme rating given by a user because of personal mood swings, the tensor decomposition model may incorrectly include these extreme ratings in the analysis, causing the model to capture the wrong user preferences and thus recommend irrelevant movies. In terms of improving the accuracy of the algorithm, the e-commerce platform provides users with personalized product recommendations through the recommendation system. If the algorithm of the recommendation system is low in precision and cannot accurately predict the user’s preferences, it may lead to poor user experience and users cannot quickly find the goods they need, thus affecting sales and customer satisfaction.

To address these problems, we develop BS-CP, a Bayesian Streaming CP decomposition approach. We use moment matching method according to [12] to address the intractable moment calculation in ADF framework, and develop a one-shot incremental posterior update for the latent factors upon receiving each streaming batch. Because our approach, unlike the black box scenario of deep learning, uses a machine learning approach based on Bayesian probability inference, the algorithm is interpretable, such as the e-commerce recommendation system, which can explain the reason for the recommendation well to the user. It provides a more computationally efficient method, approximating the result of state estimation by matching only the first and second order moments. The structural diagram of the BS-CP algorithm is presented in S2 Fig. It can improve the state estimation of the ADF by matching higher order moments. We name a simplified implementation as BS-CP1. Furthermore, noticing that the exponential function is needed in each calculation of posterior update, we use Gauss-Laguerre quadrature [13] to simplify operations. Gauss-Laguerre quadrature uses specific weights and nodes, which can obtain higher integration accuracy with less computation, making the integration process simpler and more stable. These weights and nodes are pre-calculated to minimize integration errors. For exponential functions, Gauss-Laguerre quadrature converges faster, so it is usually faster than other numerical integration methods when calculating the integral of such functions. We summarize it in BS-CP2.

To evaluate our method, we used four real-world datasets to examine BS-CP, including both small and large-size tensors. For comparison, we chose state-of-the-art streaming decomposition algorithm POST [6], BASS-Trucker [14] and the time-information-added streaming algorithm CT-CP [15]. In terms of operation and final predicted performance, our approach consistently outperforms competing methods. In addition, from the perspective of accuracy, BS-CP algorithm is more stable than the competing methods.

Related work

Tensor decomposition

Many methods have been devoted to compute tensor decomposition, as a classic factorization technique, CP decomposition [1] has been studied from many angles, such as CP-ALS [16–18], MAST [19], POST [6], tucker decomposition is also used, such as Tucker-ALS [4], pTucker [20], InfTucker [21, 22], BPTD [23], TGP [24], BASS-Tucker [14]. There are other decomposition methods, such as singular value decomposition (SVD) [25], tensor train decomposition (TTD) [26] and tensor rain decomposition (TRD) [27], etc. Although there are many methods have been used for tensor decomposition, the CP and tucker decomposition methods are the most widely used, and between the two of them, CP decomposition is used more frequently due to its elegance structure and good interpretability. There are some works aimed at solving the reconstruction problem of the CP decomposition model, for example, Bai et al. [28] developed a new relaxed augmented Lagrangian method for solving a series of convex optimization problems subject to equality or inequality constraints. the augmented Lagrangian method or the involved splitting method in this method can also solve the reformulation problem of CP Tensor Decomposition Model. The aforementioned previous works include both static [4, 17–21, 23], etc. and streaming [6, 7, 26], etc. methods. The streaming data processing method is more practical in the current recommendation system algorithm because of its advantages of timeliness. For example, apps like Instagram, their data will not be stored after it has been accessed. which means we don’t have access to any data other than what we have at this moment and need to update the embed and the model immediately based on the current receiving data. Therefore is necessary to construct a good streaming data processing method, which can be applied to a wider range of recommendation systems. We list the published information on tensor decomposition algorithms in Table 1.

[Figure omitted. See PDF.]

Bayes streaming algorithm

Streaming algorithm [29] is a kind of data processing method, which emphasizes the real-time and continuity of data. In streaming computing, data is continuously input into the system in the form of a stream and processed immediately. This method is suitable for processing large-scale, high-speed data flow, real-time analysis, real-time feedback and other scenarios. Bayesian streaming data processing [30] is a statistical learning technique for processing real-time data streams, which combines the principles of Bayesian inference and online learning (i.e., real-time learning). Streaming data processing involves a series of related work, such as the application of Bayesian networks in real-time systems, the development of online learning algorithms, and the application of time series analysis in streaming data processing. There have been several recent papers that attempt to add continues-time factor into streaming Bayesian data processing, Zhang et al., model a time function as the CP coefficients by using polynomial splines, namely the dynamic tensor recommendation system (CT-CP) [15]. BCTT [7], proposed by Fang et al., conducted continuous-time Tucker decomposition, and built the tensor-core as a function of time. Adding continuous-time to the model can capture complex time relationships between tensor nodes, but their accuracy are not good enough. We will later compare the representative algorithms CT-CP with our model. The concept of “Bayesian streaming data” and the corresponding processing technology have been discussed in many articles, such as Mardani et al. [31], Zhou et al. [32], POST, BASS. In order to solve the problem of minimizing the KL-divergence between real and estimated posterior, according to Thomas P Minka [33], there are several solutions: Laplace’s method [9], Markov Chain Monte Carlo (MCMC) [8], variational Bayes (or variational inference) [10], (Streaming variational Bayes (SVB) in streaming structure [34]), Assumed density filtering (ADF) [11]. In ADF, observations are processed one by one and then the posterior distribution is updated to approximate the posterior distribution before processing the next observation. Minka posted expectation propagation (EP) [33], which is an extension to ADF, removed the sequential nature of it. To process moment calculation, moment matching [35] method is proposed. However, sometimes it can be difficult to compute moments, to solve this problem, Wang and Zhe [36] proposed conditional EP (CEP), using conditional moment matching and Taylor approximation method. Similar to that, our algorithm is also an improvement of moment matching, using Gauss-Laguerre quadrature [13] to compute the hard-to-compute normalized function that appears in the moment match, Gauss-Laguerre quadrature method uses its own orthogonality to simplify the calculation of exponential integral and improve the convergence speed and accuracy of the whole model.

BS-CP compared to other methods

Most of the above methods use CP decomposition and Tucker decomposition techniques, as well as decomposition methods using SVD, tensor train and tensor ring decomposition(STTD [26] and STRD [27]). Notably, CP decomposition and Tucker decomposition are popular due to their intuitive mathematical formulas and stable numerical calculations. Specifically, CP decomposition represents a tensor as the sum of multiple rank-1 tensors, providing a more streamlined model and improving processing efficiency; Therefore, our BS-CP method utilizes CP decomposition. On the contrary, Tucker-ALS, pTucker, InfTucker, BPTD, TGP, BASS-Tucker algorithm uses Tucker decomposition framework and requires more complex post-decomposition processing to adapt to static target analysis and streaming data applications. In an online e-commerce recommendation system, real-time dynamic data must be processed, rather than just focusing on static data(Tucker-ALS, CP-ALS, pTucker, CP-WOPT, InfTucker, BPGD, TGP, MAST) sets that require a large amount of storage capacity. Therefore, BS-CP adopt a streaming data processing approach to effectively address contemporary needs.

Compared with traditional batch learning methods, our BS-CP method, using Bayesian online data processing technology helps to adapt to new information in real time. When juxtaposed with other online learning methods, these methods provide probabilistic interpretations of model parameters while quantifying the uncertainty of predictions. Frequency-based statistical methods often lack the ability to integrate both prior knowledge and empirical data at the same time, and the advantage of Bayesian methods in this regard is that they naturally blend the two elements. In addition, compared to deep learning frameworks characterized by opaque internal mechanisms, Bayesian online strategies provide clear probabilistic insights into the behavior of parameters.

The ADF method we employ allows greater flexibility across different posterior distribution shapes than the Laplacian technique; In addition, ADF avoids the iterative sampling required by MCMC for a posterior approximation. This makes ADF particularly suitable for non-parametric models, compared to VB or VI, which tend to favor parametric structures; In addition, compared to SVB methods, ADF facilitates more seamless online learning through continuous model updates as new data becomes available, a process that is complicated in SVB frameworks, such as the POST algorithm. Extensions such as EP and CEP [36] build on ADF methods; Although CEP employs moment matching and Taylor approximation techniques similar to ours, the mathematical complexity of the latter presents a challenge in implementation, while BS-CP utilize Gauss-Laguerre orthogonality to simplify the index calculation.

Recently developed methods for integrating temporal information into tensor decomposition (e.g., CT-CP [15] and BCTT [7]) have demonstrated improved capabilities in capturing trends and patterns inherent in time-varying datasets—illustrated by applications such as stock market analysis or financial forecasting, where temporal dynamics play a crucial role. However, in scenarios where time is not the most important or where there is no time in the dataset feature, the relevance of time considerations can be significantly reduced depending on the particular application context and underlying data attributes.

CP tensor decomposition model

Consider a K-mode tensor in . We aim to estimate K factor tensors, , an R dimensional vector represent each object j in mode k. A d_k × R embedding matrix can then be built by stacking all the embedding vectors in mode k, we have . Suppose we have collected a dataset of real-valued tensor entry values, where each i_n = (i_n1, …, i_nK). We can easily obtain from the vectorized form that for each observed entry i_n,(1)where 1 is the vector full of ones, ° performs element-wise multiplication.

Based on CANDECOMP/PARAFAC (CP) decomposition, we use Bayesian model to sample the continuous observed tensor entries. The Bayesian network diagram of CP decomposition is shown in S4 Fig. Placing a standard Gaussian prior over ,(2)Where . Each embedding vector is first sampled from a Gaussian prior distribution. We model the likelihood for each data point n with(3)where τ is the inverse variance. We further place a Gamma prior over τ,(4)where we can set a₀ = b₀ = 10⁻³ to obtain a diffuse prior. Our joint probability distribution is given by(5)

Online bayesian streaming algorithm

Streaming structure

Now, we consider the data is produced in a streaming fashion as shown in S3 Fig. Denote by the data accumulated up to step n − 1, and by the posterior estimated by our model. Currently, let us assume at step n, we receive a single entry (i_n, y_n) (we will extend it to a batch of entries later). Our goal is then to update the posterior, which according to the Bayes’ rule, is(6)

To enable tractable posterior update, we introduce a fully factorized approximation to the current posterior, which is consistent with the prior,(7)

Obviously, when n = 1, we have a = a₀, b = b₀, , and . Now, we can obtain the RHS of (6),(8)

Note that, only the factors that involve in the observed entry i_n will get their posterior updated. For other factors, their running posterior remains the same. We use the assumed density filtering framework. That is, we will project the RHS of (8) into the the exponentially family to obtain the approximation of , which has the same factorized form in (7). Following Minka’s Ph.D. thesis, what we need to do is to compute the normalizer,(9)and then apply the following formula,(10)

Then we set the above as the updated mean and variance for the running posterior, namely,(11)

The posterior of τ are(12)

To obtain it, we need to compute the first and second moments of τ, with which to match a new Gamma distribution. Specifically, givenwe want to marginalize out α, and compute and . Then according towe can obtain(13)

The normalizer (9), however, does not have a closed form. To address this problem, we use moment-match to construct a Gaussian posterior approximation of(14)

That is,where(15)

is the expectation operator. To compute the first-order and second-order moments, we define . First, we compute first-order moments:(16)

Now let us compute second-order moments.(17)Where tr performs trace operation.

Let’s first fix τ, marginalize the Gamma function, the normalizer (9) turn into(18)

Since the marginal distribution of y_n is equal to the above Eq (18) and, at the same time, because the marginal distribution of y_n is equal to the integral of the marginal probability density function of y_n, we obtain(19)

In this way, we could update the posterior in a simple and efficient fashion without using the complex form of KL-Divergence (Kullback-Leibler Divergence) minimization in variational inference framework [37] like POST [6]. The online Bayesian streaming structure method of fixing τ (BS-CP1) is shown in Algorithm 1.

Algorithm 1 Streaming algorithm BS-CP1

1: while A new set of tensor entries arrived, do

2: For each brand-new object n at each mode k, initialize .

3: repeat

4: Update each associated with , i.e., j ∈ [1, d_k], r ∈ [1, R] and k ∈ [1, K], first update using (19), then use Eqs (10), (11) and (15).

5: until Convergence has been reached, or the maximum number of iterations have finished.

6: Set each for 1 ≤ j ≤ d_k, 1 ≤ k ≤ K.

7: end while

8: return The posteriors of the embedding vectors for each object that have been seen, 1 ≤ k ≤ K.

Gauss-Laguerre quadrature method

Now, lets consider take Gamma function into consideration. To compute the normalizer (9), we have(20)

Even though we use moment-match, the above Eq (20) still can be a bit tricky to get. It involves integration that contains exponential terms, requiring complex computation. Gauss-Laguerre quadrature [13] is a Gaussian integration method for numerical integration, which is a generalization of the Gaussian integration method and is specifically used to compute integrals over semi-infinite intervals [0, +∞). This integration method uses the orthogonality of Laguerre polynomials to improve the accuracy and efficiency of integration. For the above reasons, we use Gauss-Laguerre quadrature to accelerate the convergence speed and improve the calculation accuracy. Let’s transform the above equation into(21)we can use Gauss-Laguerre quadrature to compute (21),(22)where w_j and s_j are quadrature weights and nodes. Here

This leads to two problems: the selection of nodes and the setting of corresponding weights. How to choose the nodes and weights of the Gauss-Laguerre function is tricky. We can choose the value of node n through Laguerre’s formula [13], but this method is not always optimal, we need to try. The good news is that engineers have given the commonly used Gauss-Laguerre quadrature nodes and quadrature weights [38]. We sample from the smallest n = 1 until n = 5. The larger n, the higher the accuracy, but if f(s) is not very complicated, such as our f(s), the larger n will result in more computation and almost no improvement in accuracy.

To ensure numerical stability, we observe(23)

Now, we can compute in an easy way.

Back to . We also need to obtain the moments and To obtain them, all we need to do is to run GL quadrature twice more,(24)which using the same variable change, we obtain(25)

We can use the same computational trick above to compute and , and get(26)

Now, we can use (11) to update the posterior and the factors associated with the entry.

By using Gauss-Laguerre quadrature method, the accuracy is improved, and the model converges more quickly, it is more efficient and reliable, and a further improvement on our original method. The Gauss-Laguerre quadrature fasioned online Bayesian streaming structure (BS-CP2) is summarized in Algorithm Algorithm 2.

Our algorithm is a model-based recommendation algorithm that tries to quantify how much a user likes an item they haven’t encountered before. To achieve this, we use tensor decomposition and Bayesian posterior probability inference to train a vector of items (for a specific user) and then build a model to predict the user’s score for a new item. The principle of CP tensor decomposition algorithm is clear and clear, and the Bayesian probability inference model is often used in daily life, thus our BS-CP is easy to explain to users and has good interpretability.

Algorithm 2 Streaming algorithm BS-CP2

1: Initialize p(τ) with an uninformative Gamma distribution, p(τ) = Gam(τ|10⁻³, 10⁻³).

2: while A new set of tensor entries arrived, do

3: For each brand-new object n at each mode k, initialize .

4: repeat

5: Update each associated with , i.e., j ∈ [1, d_k], r ∈ [1, R] and k ∈ [1, K], first update using (21), (22), (23), then use Eqs (10), (11) and (15).

6: Update q*(τ) using (12), (13), (26).

7: until Convergence has been reached, or the maximum number of iterations have finished.

8: Set each for 1 ≤ j ≤ d_k, 1 ≤ k ≤ K. Set q(τ) = q*(τ).

9: end while

10: return The posteriors of the embedding vectors for each object that have been seen, 1 ≤ k ≤ K.

Complexity

The time complexity of BS-CP1 in each batch is , where r is the rank, space complexity is , which is to store the posterior mean and covariance of the latent factors. The time complexity of BS-CP2 is , where n is the nodes of Gauss-Laguerre quadrature method, the number of nodes n of the Gauss-Laguerre quadrature is fixed in each update step. Space complexity is , which is to store the embedding posteriors, the data batch, the nodes of Gauss-Laguerre quadrature method, and the posterior q(τ). It can be concluded that the time complexity of BS-CP2 is worse than BS-CP1 but the space complexity is better due to the Gauss-Laguerre quadrature integral value approximate method.

Experiment

Model comparison

The method we chose for comparison is the latest algorithm with a structure similar to our method, it does not include some non-universal methods generated for a certain situation or focus on solving a certain class of problems (such as time parameters participating in tensor decomposition or under sparse data) used in different scenarios. Our method based on Bayesian streaming data processing, which can be used in most application scenarios (except extreme conditions). Therefore, we also choose the latest SOTA methods that can be applied to most scenarios for comparison. We compare our approach to the following three categories of approaches:

1. the state-of-the-art (SOTA) multilinear streaming tensor decomposition algorithm:

* POST [6], which was based on SVB, using CP decomposition.

* BASS-Tucker [14], using Tucker decomposition, placed a spike-and-slab prior over the core tensor elements, build the model based on ADF and moment-match.

2. SOTA static multilinear decomposition methods:

1. CP-WOPT [5], this is a CP decomposition by conjugate gradient descent method.

2. CP-ALS [16], CP decomposition via alternating least square updates, which has been used widely.

3. Tucker-ALS [4], Tucker decomposition via alternating least square updates.

3. SOTA continuous-time multilinear tensor decomposition algorithm:

1. CT-CP [15],in this method, the time tensor value function is used to integrate time and context information, and the time-varying coefficient model of time tensor decomposition is established by polynomial spline approximation.

Datasets

We evaluated BS-CP1 and BS-CP2 in four real-world applications. The information of our dataset is summarized in Table 2.

1. Beijing-15k [39], the dataset includes six pollutants from 12 state-controlled air quality monitoring sites from 2013 to 2017. Air quality data were obtained from the Beijing Municipal Environmental Monitoring Center. We randomly selected 15K continuous measurements and their time stamps for training and testing. Its a real-valued four-way tensor of size 12 × 6 × 12 × 1461.

2. Beijing-20k [39],Compared with Beijing-15k, we reduce the acquisition of one factor and simplify the data set. For training and testing, we randomly selected 20K continuous measurements. Its a real-valued three-way tensor of size 12 × 6 × 315.

3. MovieLens-1m [40], a two-mode continuous tensor of size 6040 × 3706, consisting of (user, movie) ratings. The number of observed elements is 1000,209.

4. Fit Record [41], an EndoMondo user’s exercise log of outdoor activities. We extracted three-way interactions between 500 users, 20 exercise types, 50 attitudes. That is 500 × 20 × 50 × 101. The result is the user’s heart rate during exercise, measured for 4,000 to 6,000 seconds. We collected 49,978 heart rates and time stamps in total.

[Figure omitted. See PDF.]

Settings

We implemented BS-CP1 and BS-CP2 using software environment: Pytorch [42], Windows, hardware environment: Intel(R) Core(TM) i9–14900HX 2.20GHz, 16G RAM, NVIDIA GeForce RTX 4060 GPU.

We first set a = 10⁻³, b = 10⁻³, a random number ranging from 0 to 1, and . We used the original implementation of all the comparison methods and their default settings (such as the maximum number of iterations, etc.). As mentioned in the previous algorithm, we fixed tau to 1 in the BS-CP1 method, both BS-CP1 and BS-CP2 are set according to the initialize parameter in algorithm BS-CP1 and Algorithm BS-CP2 respectively.

We evaluate the final predictive performance when all stream entries have been processed. To this end, 15,000 observations of Beijing-15k datasets were randomly divided into 75 percent for training and 25 percent for testing, and 20,000 observations of Beijing-20k datasets were randomly divided into 4 for training and testing: Due to the large MovieLens-1M datasets, 90 percent of MovieLens-1M’s 1,000,209 data are used for training and 10 percent for testing, and 49,978 observation items of Fit Record dataset are randomly divided into 4:1 training set and test set ratio. For static tensor decomposition, iterative static decomposition requires repeated access to the entire data. We fixed the batch as 1, repeated the experiment 5 times, and calculated the mean root mean square error (RMSE) of the real-valued dataset. We take the final prediction performance of each method when the tensor decomposition rank R is 3,5,8,10, respectively. Then, we give the total time for BS-CP1 and BS-CP2 to complete a round under different ratios of training set and test set (0.25 and 0.1) when the values of R are different (R = 3,5,8,10). In addition, we explored the relationship between the selection of Gauss-Laguerre integral nodes and RMSE. We selected nodes n = 1,2,3,4,5, and the corresponding weights (obtained from the Gauss-Laguerre quadrature nodes and quadrature weights Table [38]). Such a setup evaluates the relationship between the rank of the tensor decomposition and the accuracy of the calculation.

For all algorithms for which experiments were performed, we used the original implementations of all competing methods and their default Settings (for example, the maximum number of iterations for static decomposition). We evaluate the final prediction performance, that is, when all stream entries have been processed. For POST, MASS-Tucker, CT-CP and our BS-CP1 and BS-CP2, we randomly divide the training entries into mini-batch streams. Other approaches perform iterative static decomposition, requiring repeated access to the entire data.

Metric

We selected RMSE (Root Mean Squared Error) as the evaluation metric. RMSE is a commonly used performance metric for machine learning, it is the square root of MSE (Mean Squared Error), which can reflect the average difference between the predicted value and the true value, and its calculation formula is:(27)where n is the number of samples, Y_i is the ith true value, is the ith predicted value. RMSE can reflect the standard deviation of the real error and facilitate error analysis and is more sensitive to outliers.

Meanwhile, we included an examination of the standard deviation (STD) of our RMSE to indicate the stability of our model. The formula used to calculate standard deviation is as follows:(28)where a is the number of runs of the algorithm, x_j is the results obtained for each run of the algorithm, μ is the average of the results from each run of the algorithm (the sum of the algorithm results divided by a). The value of the standard deviation is a very effective indication of the stability of the algorithm.

Results

RMSE of different methods

We show the line plots of RMSE values and standard deviations(STD) of all comparison methods and our two methods when R is 3,5,8,10 in S5 Fig. It can be seen from S5 Fig that, except slightly worse than POST in some R values in Beijing-20k dataset and worse than CT-CP when R = 10 in Fit Record dataset, BS-CP1 and BS-CP2 are better than the comparison method in all the other cases. This demonstrates the superiority of the structure of our BS-CP algorithm.

However, observing the performance of POST on the four datasets, it can be seen that the performance of POST algorithm on different datasets is different. This may be due to the fact that POST algorithm uses SVB to approximate Bayesian inference and falls into local optimal solutions. As a result, it is sensitive to some datasets (Beijing-20k) and insensitive to others (MovieLens-1m). However, our method performs well on all data sets, which shows that our method has certain universality. However, the reason why CT-CP outperforms our method when R = 10 May be because the Fit Record dataset itself is sensitive to temporal information, and CT-CP method embeds temporal information into tensor decomposition, which has advantages in time prediction. Although our method does not focus on the time factor, the effect is not worse than CT-CP which specifically builds the time-varying coefficient model of time tensor decomposition, which is enough to show the wide application and excellent effect of our method. Meanwhile, in all cases, BS-CP1 and BS-CP2 are better than SOTA static decomposition method.

More important, it can be seen from S5(c) and S5(d) Fig that our BS-CP structure algorithm has a larger improvement than most other methods. We put the RMSE values of all the streaming data processing methods together for comparison, which are presented in S6 Fig. As it can be seen from the S6 Fig, the red and purple columns on the right side (BS-CP1 and BS-CP2) have better RMSE than most of the columns on the left side for streaming methods used for comparison. Looking at the last two datasets, comparing our method BS-CP1 with the SOTA streaming method BASS-Tucker, we achieve 31.8% and 33.3% RMSE improvement in these two datasets. Consider that MovieLens-1m and Fit Record are relatively large datasets (1000,209 and 49,978 elements), such results further illustrate the effect of our method when facing large real datasets.

And we can see from the numerical figures in Table 3, BS-CP is superior to the comparison method in most cases, among them, MovieLens-1m has the most obvious improvement. At the same time, it can be seen that in Table 3, the STD value of our BS-CP methods are smaller than that of most methods, which shows that the stability of the BS-CP method is generally better than that of other methods. At the same time, the STD value of BS-CP2 algorithm is smaller than that of BS-CP1, indicating the advantage of using Gauss-Laguerre quadrature method. The results of these demonstrations prove the advantage of BS-CP in prediction accuracy.

[Figure omitted. See PDF.]

These are final prediction error with R = 3, BS-CP1 and BS-CP2 run with R = 3 and nodes n = 5.

Stability and saliency of the model

Table 3 and S5 Fig also reflect the effect of different data set types on the value of the standard deviation. In S5 Fig, we can see that the standard deviation of the larger data sets (MovieLens-1m and Fit Record) is smaller, because the stability of the algorithm is improved with more training data. The largest dataset, MovieLens-1m, has the smallest standard deviation, meaning it has the best stability. Meanwhile, by comparing the STD values of our methods BS-CP1 and BS-CP2 with those of other methods, it can be seen from S5 Fig that the STD values of our method are better than those of the other compared methods, indicating that our methods has good stability.

In order to prove the significance of the advantages of our method, BS-CP1 and BS-CP2 are compared with other comparison methods by student’s t test. Follow the student’s t test formula:(29)where and are the average of the first and second set of data, and are the STD of them, N₁ and N₂ are the number of first and second set of data.

In this way we can easily obtain the t test value of each compared method compared with BS-CP1 and BS-CP2 within all datasets. By looking up the table of critical values for the t-distribution we can obtain the range of p-values corresponding to the t values. The conclusion we have is that except POST, BS-CP1 and BS-CP2 shows an significant improvement (p < 0.05) compared to all other comparison methods, demonstrates the saliency and effectiveness of our approach.

Training time

In S7 Fig, we show the running time of BS-CP1 and BS-CP2 under different datasets when R = (3, 5, 8, 10). We can intuitively see from S7 Fig that BS-CP2 takes more time than BS-CP1, which is the disadvantage of BS-CP2’s higher stability. It can be seen that the overall training time of BS-CP2 shows a downward trend, same cases in BS-CP1 when the datasets are Beijing-15k and MovieLens-1m. This phenomenon may be attributed to the application of the Gauss-Laguerre quadrature method in BS-CP2, which is based on the orthogonality of the Laguerre polynomial and helps to simplify the calculation process of the integral. It is not constrained by the value of R. On the contrary, in the more complicated case of exponential integration, The Gauss-Laguerre quadrature method may converge faster. This further indicates the advantage of the BS-CP2 method. In Beijing-20k and Fit Record datasets, BS-CP1 showed an up-and-down phenomenon with an increasing or not obvious trend. On the whole, the algorithm time of BS-CP1 and BS-CP2 shows a trend of decreasing with the increase of R. This shows that our method has some scalability, it also shows that the scalability of BS-CP2 is better than that of BS-CP1.

S8 Fig shows the time consuming of BS-CP2 algorithm when the number of nodes n takes different values. We can see that the smaller n is, the shorter the training time, and the training time is proportional to the value of n. It shows that BS-CP2 algorithm has good linear scalability.

Effect of the number of nodes on RMSE

We showed the relationship between RMSE and node values n in S9 Fig. As can be seen from it, RMSE value is not simply a monotone function with node n value. In S9(a) Fig, RMSE value increases slightly (greater than 0.001) as the number of nodes increases. As the increase is too small (less than 0.0001) in S9(b) Fig, we drew the RMSE value only when R = 3 for comparison, and it can be seen that the RMSE value slightly increases with the number of nodes. S9(c) Fig is the same as above, RMSE increases as the value of node n increases. But in S9(d) Fig, NAN value occurs when r = 3, n = 1, but when r = 8, n = 1, RMSE is preferable to the minimum value 0.529605, which is much better than rest of the methods. The next best case without a NAN value where the value of a node is n = 5.

This is because the Gauss-Laguerre quadrature has the property of fast convergence. Although increasing the number of nodes means increasing the training time, the accuracy of integration will improve rapidly with the increase of the number of nodes and weights.

Conclusion

A streaming Bayes CP tensor decomposition algorithm BS-CP is proposed. This method can effectively process streaming data and provide a posterior update. On the basis of this framework, we propose two algorithms, BS-CP1 and BS-CP2, in which BS-CP1 is a simple implementation method of BS-CP, and BS-CP2 uses Gauss-Laguerre function to reduce the error of posterior update parameters and further improve the accuracy of the algorithm. We compare these two algorithms with static and streaming SOTA tensor decomposition recommended algorithms, and select four real world data sets of different sizes for validation. By comparing their RMSE values, the RMSE values of our two algorithms are overwhelmingly superior to those of the selected method, and compared with the advanced streaming SOTA method, the RMSE values of our two algorithms are better than those of the selected method. Greater improvements were achieved on large data sets, which demonstrated the advantages of our approach.

In addition, we explored the actual training time of BS-CP1 and BS-CP2 methods and the relationship between time and the selection of R-value. As a whole, BS-CP2 shows a trend of decreasing algorithm time with the increase of R value, while BS-CP1 shows a fluctuation pattern. At the same time, the influence of the selection of BS-CP2 node value n on the time consuming is also explored. Then, we explored the influence of the selection of node value n of BS-CP2 on RMSE. Except for some abrupt points in the last dataset, RMSE value mostly increased with the increase of n value. These experiments prove the stability and scalability of our algorithm.

However, our method also has certain limitations, if the data contains many zeros or missing values, CP decomposition may not accurately capture the data structure, and additional techniques (such as regularization or low-rank approximation) are required to handle sparse data. In non-random noise environment, CP decomposition may be affected by noise, resulting in inaccurate decomposition results. If the Moment Match method does not select enough moments or reasonable moments, it may be affected by noise, resulting in inaccurate approximate probability density function. The accuracy of the Gauss-Laguerre quadrature method may be compromised in high-noise environments, especially when the distribution of noise is unknown or does not match the type of integral function. In the next step, we will improve our method for extreme environments (sparse data or high-noise environments), such as regularization, low-rank approximation or smoothing of the data, so that our algorithm can be applied more widely. At the same time, our algorithm does not consider the importance of time variables. In the next step, time parameters will be introduced into tensor decomposition to make our algorithm more suitable for time-sensitive data.

Supporting information

S1 Fig. CP decomposition and Tucker decomposition when the tensor is three mode.

https://doi.org/10.1371/journal.pone.0312723.s001

(TIF)

S2 Fig. The structure diagram of the BS-CP algorithm.

https://doi.org/10.1371/journal.pone.0312723.s002

(TIF)

S3 Fig. Online Bayesian Streaming Structure.

https://doi.org/10.1371/journal.pone.0312723.s003

(TIF)

S4 Fig. CP decomposition Bayes network.

https://doi.org/10.1371/journal.pone.0312723.s004

(TIF)

S5 Fig. Predictive performance with different numbers of factors.

The results are averaged over 5 runs. The performance of several baselines are incomplete, because they are far worse than all the other methods and hence not included.

https://doi.org/10.1371/journal.pone.0312723.s005

(ZIP)

S6 Fig. Streaming method RMSE compared to BS-CP1 and BS-CP2.

https://doi.org/10.1371/journal.pone.0312723.s006

(TIF)

S7 Fig. The total time taken to complete the data set once, when BS-CP1 and BS-CP2 are at different values of R= (3,5,8,10), n = 5.

https://doi.org/10.1371/journal.pone.0312723.s007

(TIF)

S8 Fig. The relationship between the number of nodes and the time consuming of the BS-CP2 algorithm is shown here as dataset Beijing-15k, when R = 3,5,8,10.

https://doi.org/10.1371/journal.pone.0312723.s008

(TIF)

S9 Fig. When different node values n (1, 2, 3, 4, 5), the relationship between RMSE and n values of the four above datasets is presented in (a),(b),(c),(d) when R values are different.

Note that some points have no value because the value is NAN. (b) and (c) are RMSE when R = 3 because the difference between the RMSE values of the two sets of data is too small, we chose an obvious drawing way.

https://doi.org/10.1371/journal.pone.0312723.s009

(TIF)

References

1. 1. Harshman RA. Foundations of the PARAFAC procedure: Model and conditions for an”explanatory”multi-mode factor analysis. Ucla Working Papers in Phonetics. 1970;.

2. 2. Tucker L. Some mathematical notes on three-mode factor analysis. Psychometrika. 1966;31(3):279–311. pmid:5221127

* View Article

* PubMed/NCBI

* Google Scholar

3. 3. De Lathauwer L. Decompositions of a higher-order tensor in block terms—Part II: Definitions and uniqueness. Siam Journal on Matrix Analysis & Applications. 2011;30(3):1033–1066.

* View Article

* Google Scholar

4. 4. Bader BW, Kolda TG. Efficient MATLAB Computations with Sparse and Factored Tensors. SIAM Journal on Scientific Computing. 2008;30(1):205–231.

* View Article

* Google Scholar

5. 5. Acar E, Dunlavy DM, Kolda TG, Mørup M. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems. 2011;106(1):41–56.

* View Article

* Google Scholar

6. 6. Du Y, Zheng Y, Lee Kc, Zhe S. Probabilistic Streaming Tensor Decomposition. In: 2018 IEEE International Conference on Data Mining (ICDM); 2018. p. 99–108.

7. 7. Fang S, Narayan A, Kirby R, Zhe S. Bayesian Continuous-Time Tucker Decomposition. In: Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, editors. Proceedings of the 39th International Conference on Machine Learning. vol. 162 of Proceedings of Machine Learning Research. PMLR; 2022. p. 6235–6245. Available from: https://proceedings.mlr.press/v162/fang22b.html.

8. 8. Neal RM. Probabilistic inference methods using Markov chain Monte Carlo methods. Department of Computer Science University of Toronto; 2011.

9. 9. Kass RE, Raftery AE. Bayes Factors and Model Uncertainty. Journal of the American Statistical Association. 1995;90.

* View Article

* Google Scholar

10. 10. Attias H. Inferring Parameters and Structure of Latent Variable Models by Variational Bayes. In: Conference on Uncertainty in Artificial Intelligence; 1999.Available from: https://api.semanticscholar.org/CorpusID:13371224.

11. 11. Boyen X, Koller D. Tractable Inference for Complex Stochastic Processes. Computer Science. 2013; p. 33–42.

* View Article

* Google Scholar

12. 12. Picard R, Smith AC, Minka TP. A family of algorithms for approximate Bayesian inference. Thesis Massachusetts Institute of Technology. 2001;.

13. 13. Koekoek R, Meijer HG. A Generalization of Laguerre Polynomials. SIAM Journal on Mathematical Analysis. 1993;24(3):768–782.

* View Article

* Google Scholar

14. 14. Fang S, Kirby RM, Zhe S. Bayesian streaming sparse Tucker decomposition. In: Conference on Uncertainty in Artificial Intelligence; 2021.Available from: https://api.semanticscholar.org/CorpusID:237511829.

15. 15. Zhang Y, Bi X, Tang N, Qu A. Dynamic tensor recommender systems. J Mach Learn Res. 2021;22(1).

* View Article

* Google Scholar

16. 16. Kolda TG, Bader BW. Tensor Decompositions and Applications. SIAM Review. 2009;51(3):455–500.

* View Article

* Google Scholar

17. 17. Symeonidis P. Matrix and Tensor Decomposition in Recommender Systems. In: Acm Conference on Recommender Systems. New York, NY, USA: Association for Computing Machinery; 2016.Available from: https://doi.org/10.1145/2959100.2959195.

18. 18. Park N, Oh S, Kang U. Fast and Scalable Distributed Boolean Tensor Factorization. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE); 2017. p. 1071–1082.

19. 19. Song Q, Huang X, Ge H, Caverlee J, Hu X. Multi-Aspect Streaming Tensor Completion. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: Association for Computing Machinery; 2017.Available from: https://doi.org/10.1145/3097983.3098007.

20. 20. Chu W, Ghahramani Z. Probabilistic Models for Incomplete Multi-dimensional Arrays. In: International Conference on Artificial Intelligence and Statistics; 2009.Available from: https://api.semanticscholar.org/CorpusID:16124749.

21. 21. Xu Z, Yan F, Qi Y. Infinite tucker decomposition: nonparametric Bayesian models for multiway data analysis. In: Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress; 2012. p. 1675–1682.

22. 22. Hoff PD. Equivariant and scale-free Tucker decomposition models; 2013. Available from: https://arxiv.org/abs/1312.6397.

23. 23. Schein A, Zhou M, Blei DM, Wallach HM. Bayesian Poisson Tucker Decomposition for Learning the Structure of International Relations. In: International Conference on Machine Learning; 2016.Available from: https://api.semanticscholar.org/CorpusID:8500579.

24. 24. Kim H, Lu X, Flaxman S, Teh YW. Tucker Gaussian Process for Regression and Collaborative Filtering. ArXiv. 2016;abs/1605.07025.

* View Article

* Google Scholar

25. 25. Kilmer ME, Martin CD. Factorization strategies for third-order tensors. Linear Algebra and its Applications. 2011;435(3):641–658.

* View Article

* Google Scholar

26. 26. Huang Y, Feng Y, Liao Q. Streaming probabilistic tensor train decomposition. ArXiv. 2023;.

27. 27. Yuan L, Li C, Mandic D, Cao J, Zhao Q. Tensor ring decomposition with rank minimization on latent space: an efficient approach for tensor completion. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press; 2019.Available from: https://doi.org/10.1609/aaai.v33i01.33019151.

28. 28. Jianchao Bai, Linyuan Jia, Zheng Peng A New Insight on Augmented Lagrangian Method with Applications in Machine Learning. Journal of Scientific Computing.Available from: https://doi.org/10.1007/s10915-024-02518-0.

29. 29. Mccutchen RM, Khuller S. Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity. In: International Workshop on Approximation Algorithms for Combinatorial Optimization Problems;APPROX 2008;International Workshop on Randomization and Computation;RANDOM 2008; 2008.Available from: https://api.semanticscholar.org/CorpusID:1708448.

30. 30. Nion D, Sidiropoulos ND. Adaptive Algorithms to Track the PARAFAC Decomposition of a Third-Order Tensor. IEEE Transactions on Signal Processing. 2009;57(6):2299–2310.

* View Article

* Google Scholar

31. 31. Mardani M, Mateos G, Giannakis GB. Subspace Learning and Imputation for Streaming Big Data Matrices and Tensors. IEEE Transactions on Signal Processing. 2015;63(10):2663–2677.

* View Article

* Google Scholar

32. 32. Zhou S, Vinh NX, Bailey J, Jia Y, Davidson I. Accelerating Online CP Decompositions for Higher Order Tensors. In: Acm Sigkdd International Conference. New York, NY, USA: Association for Computing Machinery; 2016.Available from: https://doi.org/10.1145/2939672.2939763.

33. 33. Minka TP. Expectation Propagation for approximate Bayesian inference. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.; 2001. p. 362–369.

34. 34. Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI. Streaming Variational Bayes. In: Proceedings of the 26th International Conference on Neural Information Processing Systems—Volume 2. Red Hook, NY, USA: Curran Associates Inc.; 2013. p. 1727–1735.

35. 35. Tuenter HJH. An algorithm to determine the parameters of SU -curves in the johnson system of probabillity distributions by moment matching. Journal of Statistical Computation and Simulation. 2001;70(4):325–347.

* View Article

* Google Scholar

36. 36. Wang Z, Zhe S. Conditional Expectation Propagation. ArXiv. 2019;abs/1910.12360.

* View Article

* Google Scholar

37. 37. Wainwright MJ, Jordan MI. Graphical Models, Exponential Families, and Variational Inference. Foundations & Trends in Machine Learning. 2008;1(1–2):1–305. https://doi.org/10.1561/2200000001

* View Article

* Google Scholar

38. 38. Süli E, Mayers DF. An Introduction to Numerical Analysis. Cambridge University Press; 2003.

39. 39. Chen S. Beijing Multi-Site Air Quality; 2019. UCI Machine Learning Repository.

40. 40. Harper FM, Konstan JA. The MovieLens Datasets: History and Context. ACM Trans Interact Intell Syst. 2015;5(4).

* View Article

* Google Scholar

41. 41. European Organization For Nuclear Research, OpenAIRE. Zenodo; 2013. Available from: https://www.zenodo.org/.

42. 42. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. ArXiv. 2019;abs/1912.01703.

* View Article

* Google Scholar

Citation: Liu J, Wu Q, Jiang L, Zhan R, Zhao X, Wu H, et al. (2024) BS-CP: Efficient streaming Bayesian tensor decomposition method via assumed density filtering. PLoS ONE 19(12): e0312723. https://doi.org/10.1371/journal.pone.0312723

About the Authors:

Jiaqi Liu

Contributed equally to this work with: Jiaqi Liu, Qiwu Wu

Roles: Conceptualization

Affiliation: School of Information Engineering, Engineering University of People’s Armed Police of China, Xi’an, China;

ORICD: https://orcid.org/0009-0006-3834-4775

Qiwu Wu

Contributed equally to this work with: Jiaqi Liu, Qiwu Wu

Roles: Funding acquisition

E-mail: [email protected]

Affiliation: Engineering University of People’s Armed Police of China, Xi’an, China

ORICD: https://orcid.org/0009-0002-7706-7521

Lingzhi Jiang

Roles: Data curation