You may have access to the free features available through My Research. You can save searches, save documents, create alerts and more. Please log in through your library or institution to check if you have access.
You may have access to different export options including Google Drive and Microsoft OneDrive and citation management tools like RefWorks and EasyBib. Try logging in through your library or institution to get access to these tools.
Online learning is a branch of machine learning and has been studied extensively in recent years.[1–10] It aims to solve a sequence of consecutive prediction tasks by leveraging the knowledge gained from previous tasks. The goal is to make accurate predictions about the sequence. This is different from traditional offline machine learning methods that aim to learn a prediction model from the entire training set. Online learning plays an important role in practical applications involving continuous streaming of data, such as ad click-through rate prediction,[11] online collaborative filtering,[12] online malicious URL detection,[13] and online portfolio selection.[14]
In this article, we focus on the problem of online classification for online learning. Many online algorithms have been proposed to deal with successive binary classification tasks. A typical online algorithm employs a binary discriminant based on a linear hypothesis to make the prediction and updates the model for future tasks. For example, the well-known Perceptron algorithm performs a mistake-driven update to the hypothesis.[15–17] The online gradient descent (OGD) algorithm updates the hypothesis according to the direction of the current gradient.[18] Crammer et al.[19] presented the passive-aggressive (PA) algorithm, which enables a loss-driven update by utilizing the notion of margin, and two variants of PA that further take the presence of noise into consideration. On top of the PA algorithm, confidence-weighted (CW) and linearized confidence-weighted (LCW) algorithms include the notion of weight confidence through probability distributions.[20–22] Soft confidence-weighted (SCW) learning introduces the concept of a loss function to the CW algorithm such that the update will be less aggressive as compared to reacting to noising examples.[23,24] When it comes to handling successive multiclass classification tasks, most research works targeted primarily extending online algorithms for binary problems to multiclass problems. This results in a bunch of multiclass-oriented online algorithms that employ a multiclass discriminant based on multiple linear hypotheses: MIRA,[25] multiclass PA,[19,26] multiclass CW,[27] and so on. However, the prediction performance may be limited by the linearity, and thus superior feature extraction is required for complex problems to achieve satisfying performance.
The kernel trick in machine learning provides a way to bypass the linearity.[28] It kernelizes a learning algorithm by substituting all inner products in the algorithm with kernel functions, e.g., the Gaussian kernel.[29] Since many online algorithms can be kernelized, they are able to employ a discriminant based on a nonlinear hypothesis (or multiple nonlinear hypotheses) composed of kernel functions to reach a satisfying prediction performance. However, a kernel-based online algorithm has the storage requirement of support vectors (SVs) and corresponding combination coefficients for constructing a hypothesis. This causes the kernelization of an online algorithm suffers from the unlimited growth in runtime and memory consumption as more and more tasks are done. This phenomenon is known as the curse of kernelization and may lead to a broken-down situation for a kernel-based online algorithm. Some works stop the growth by setting a limit on the maximum number of SVs. For example, the budget Perceptron algorithm controls the number of SVs by removing the support element (SE) whose removal gives the maximum margin.[30] The Forgetron algorithm shrinks the hypothesis and then discards the oldest SE.[31] The randomized budget Perceptron (RBP) algorithm selects randomly an SE for removal.[32] The Projectron and Projectron++ algorithms bound the growth by projecting the kernel function of the current instance onto the subspace, which is spanned by the kernel functions centered on SVs.[33] In contrast, both Fourier OGD (FOGD) and Nyström OGD (NOGD) algorithms transform a nonlinear hypothesis into a linear hypothesis by a kernel-induced feature approximation and then apply the OGD algorithm.[34,35] Some kernel-based online algorithms are further extended to solve multiclass problems, such as multiclass Projectron++,[33] MFOGD, and MNOGD.[34,35]
Among state-of-the-art online algorithms, both CW and LCW algorithms introduce the weight confidence by using Gaussian distributions. They determine the update by a constrained optimization problem; the only difference between them is the optimization constraint that ensures a sufficiently large margin. As compared with the Perceptron and PA algorithms, the weight confidence allows each weight to be updated at a different scale and thus have a better prediction performance.[22] The CW algorithm serves as the basis to design SCW-I and SCW-II algorithms.[23] Both CW and LCW algorithms are extended to multiclass-oriented algorithms for solving multiclass problems.[24,27] Moreover, they can be kernelized to employ a nonlinear hypothesis for a better prediction performance, but they are still subject to the curse of kernelization that makes them hard to operate in practical applications where limited computing power is available. However, to the best of our knowledge, there exists no research work aiming at breaking the curse for the CW or LCW algorithm. It is worth studying how to overcome this issue for the CW or LCW algorithm; as a consequence, a nonlinear hypothesis can be employed safely in practical applications. In addition, this study may give a clue about how to break the curse for CW- or LCW-based algorithm, such as SCW-I and SCW-II.
In this article, we try to break the curse of kernelization for the kernel-based LCW algorithm, which is the LCW algorithm kernelized by the kernel trick. We propose to control the growth by putting a limit on the maximum number of SVs by the budget, which is a predefined value. Since controlling the growth gives rise to some sacrifice in the prediction performance, the challenge is how to maintain the performance by means of reducing the information loss in the hypothesis. To deal with the challenge, we proposed the budgeted LCW (BLCW) algorithm that controls the growth by the budget and utilizes projections to reduce the information loss. Specifically, the following contributions are made in this article. 1) The resource perspective is introduced to the kernel-based LCW algorithm. From the resource perspective, we treat each receiving instance as a possible resource and the kernel-based LCW algorithm as the resource manager that exploits available resources for building the hypothesis for prediction. It gives an alternative and equivalent interpretation of the kernel-based LCW algorithm. 2) Based on the resource perspective, we propose the BLCW algorithm that approximates the kernel-based LCW algorithm by exploiting a limited number of available resources. The BLCW algorithm reduces the information loss in the hypothesis by projections, which keep the information of the discarded resource. 3) We study four kinds of removal strategies for maintaining the budget and suggest the mean removal strategy as the default for the proposed BLCW algorithm due to its good and stable performance. 4) We verify the proposed BLCW algorithm and the removal strategies by conducting various numerical experiments on real datasets.
For streaming applications that receive data in terms of a sequence of examples, we often assume that their targets are generated according to a specific function or distribution. The assumption may not hold true all the time, and the problem of concept drift can result in the dynamic problem that the targeted function or distribution is changed overtime.[36,37] Many methods are proposed to solve this problem. For example, leveraging bagging,[38] adaptive random forests,[39] and Kappa updated ensemble[40] are ensemble learning-based methods. In addition to ensemble learning techniques, EAL-MAB[41] and MD-OAL[42] further extend to the realm of active learning under a limited labeling budget. In addition, the PA algorithm with the unlearning framework has been studied for the adaption of concept drift overtime.[43] Since the LCW algorithm is based on the PA algorithm, it can be applied to address this issue. The objective of this study is to develop a budgeted online algorithm that is capable of implementing the LCW algorithm while operating under the constraint of limited computational power. The budget used in this study corresponds to the maximum number of instances that can be used to represent the prediction model, while we always query the labels of the encountered instances.
The rest of the article is organized as follows. Section 2 reviews the CW and LCW algorithms, and the kernelization of LCW. In Section 3, the resource perspective is introduced to the kernel-based LCW algorithm, followed by the proposed BLCW algorithm and four removal strategies for budget maintenance. Numerical experiments are conducted in Section 4. We give the conclusions in Section 5, followed by discussions of limitations and possible future directions.
Problem Setting and Background
In this section, we review the CW and LCW algorithms and discuss their difference. Then, we introduce the kernelization of LCW. We consider the problem of online binary classification. It is performed in a sequence of consecutive rounds. Figure1 depicts the typical flow of online learning in a round. On round t, the learner first receives an instance , where is a predefined instance domain. Based on some prediction rule, the learner predicts the class label and later receives the correct class label . Then, the learner updates the prediction rule for future rounds by using the current example . The goal of online learning is to have as many correct predictions as possible for the given sequence.
[IMAGE OMITTED. SEE PDF]
Let us first delve into the prediction part of online learning. On round t, an online algorithm employs a binary discriminant based on a linear hypothesis composed of an inner product,[21]which is characterized by the weight vector . acts as feature extraction which maps instances into the desired feature space. predicts the class label of the instance , and defines a linear decision boundary for instances in the feature space. The magnitude is interpreted as the confidence in this prediction.
Based on the aforementioned setting for the linear hypothesis, both CW and LCW algorithms further introduce a Gaussian distribution over the weight vector,[21]with the mean vector and the covariance matrix . We denote by the set of symmetric positive definite matrices.[44] Entries of represent the knowledge of features, and diagonal entries of stand for the confidence in the weights. The larger a diagonal entry, the less confident in the corresponding weight, and vice versa. Off-diagonal entries record the knowledge of interactions among features. The Gaussian distribution captures the idea of confidence in weights, and hence both CW and LCW algorithms get the adjective term, “confidence-weighted”.
With the adoption of the Gaussian distribution, both CW and LCW algorithms predict the class label of the instance by using the mean value of ,where . For simplicity of notation, we denote by the feature vector of , , from now on. The variance of , , can be used to describe the confidence in this prediction: the smaller the variance, the more confidence we have in this prediction, and vice versa. Once the correct class label is received, the prediction performance is evaluated by computing the corresponding prediction mistake
Before going to the next round, both CW and LCW algorithms determine the update of the linear hypothesis for future rounds by solving a constrained optimization problem. The new hypothesis iswhere is the solution of the optimization problem. This constitutes the learning part of online learning. However, the optimization problem for both algorithms is related but different.
CW Algorithm
We review the update on round t for the CW algorithm. Because the hypothesis in the CW algorithm is parameterized by a weight vector following a Gaussian distribution , it is equivalent to that is determined by . It is natural that the updated hypothesis is required to be parameterized by a new weight vector following a new Gaussian distribution (cf. Equation (5)), or equivalently, that is determined by .
Thus, the update of the CW algorithm is determined by the solution of the following constrained optimization problem involving two Gaussian distributions,where is the Kullback–Leibler (KL) divergence between two distributions, and measures the probability that depends on with and . is a predefined probability threshold. Since the KL divergence can be interpreted as a measure of the difference between two distributions, this optimization problem aims to find the new Gaussian distribution that has minimum change in the distribution while ensuring enough probability of a correct prediction on the current example .
Equation (6) can be rewritten as the following equivalent problemwhere and is the cumulative distribution function of a standard normal distribution. We can observe from the constraint of Equation (7) that a correct prediction is not good enough for the CW algorithm; it demands a sufficiently large margin that is basically determined by the standard deviation of . In other words, the more confident in the prediction, the less strict in the constraint, and vice versa. Although the objective function of Equation (7) is convex in , Equation (7) is not a convex optimization problem because the standard deviation term of the constraint, , is concave in .[44] To conquer this difficult situation, Crammer et al.[21] have proposed two alternatives to transform Equation (7) into a convex problem. One of them is to perform a change of variables from to another symmetric positive definite matrix such that and solve the corresponding convex optimization problem.[22] This gives the update rule of the CW algorithm. The other alternative is to linearize the constraint by ignoring the square root of the standard deviation term and solve the corresponding convex optimization problem.[20] This leads to the update rule of the LCW algorithm, which will be discussed in more detail in the next section.
LCW Algorithm
The LCW algorithm is the core algorithm we use to propose the budgeted algorithm. On round t, it modifies the constraint of Equation (7) to a linearized form by omitting the square root and obtains a new constrained optimization problem as followswhich aims to find the new Gaussian distribution that has minimum change in the distribution while preserving the property of a sufficiently large margin by using the variance of , , instead of the standard deviation. The constraint of Equation (8) can be rewritten as the following probability constraint similar to that of Equation (6),which is equivalent to introduce a standard deviation term to the confidence parameter η of the CW algorithm. Compared with η, the new confidence parameter associated with the LCW algorithm will be amplified or shrunk according to that the standard deviation is larger or smaller than 1. Finally, a closed-form solution exists for Equation (8) and determines the update rule for the LCW algorithm:where , ,
Note that if and only if satisfies the constraint of Equation (8). Even though the real situation is more complicated, to have a better understanding of this update rule, we assume that is diagonal at this round. Equation (10a) suggests that for the weight with less confidence, a more aggressive update of its mean value is required, and Equation (10b) implies that for the weight with more confidence, the update of its variance is less aggressive. In addition, both prediction and learning parts of the LCW algorithm require to extract the feature vectors of instances, 's, the feature space has to be restricted to a finite-dimensional space.
Kernelization of LCW
The kernel trick is to replace an inner product on a feature space with the kernel function k which implicitly realizes the inner product:
By the kernel trick, we can kernelize the LCW algorithm to employ a kernel-based hypothesis and thus provide another way to make predictions. To see this, we first give the following lemma that provides new expressions for the mean vector and covariance matrix.
Lemma 1.Let the LCW algorithm initialize withand. The mean vectorcomputed by the LCW algorithm can be expressed as a linear combination of feature vectors of instances, and the corresponding covariance matrixcan be expressed as a linear combination of outer products of feature vectors of instances,where . In addition, the coefficients, 's and 's, depend only on inner products of feature vectors of instances.
The lemma can be proved by induction similar to the proof of Theorem 2 from ref. [22]. Let and be the mean and covariance functions on round t, respectively. Lemma 1 suggests that we can reexpress the linear hypothesis on round t as a kernel-based hypothesis characterized bywhere , is the support element (SE) and is the index set of corresponding SEs. is the combination coefficient associated with the SE for constructing the mean function , and is the combination coefficient associated with the SE pair for constructing the covariance function . In this article, instead of a support vector, an instance that is used to form a hypothesis for prediction is called an SE because we assume that the instance space X may contain any unstructured elements, including nonvector elements. Together Lemma 1 and Equation (10a,b) give the update formulas of the kernel-based hypothesis as followswithwhere and are computed by Equation (11a,b) with and . Note that if and only if . We summarize the kernel-based LCW algorithm in Algorithm1.
Algorithm 1 Kernel-Based LCW Algorithm.
1: Input: kernel k, confidence parameter
2: Initialize: , , ,
3: fordo
4: Receive an instance
5: Predict the class label
6: Receive the correct class label
7: ifthen
8:
9:
10:
11: else
12:
13: Update and by Equation (15a,b)
14: end if
15: end for
Kernel-Based LCW on a Budget
Since the kernel-based LCW algorithm employs a kernel-based hypothesis for prediction, all SEs and combination coefficients are required to be stored for constructing the mean and covariance functions. On round t, the runtime and memory consumption depend on the size of the support index set . Whenever the hypothesis fails to satisfy the optimization constraint (cf. line 7 in Algorithm 1), the size of the support index set increases. Hence, the computational burden may grow unlimitedly eventually. It is known as the curse of kernelization[45] which asks for a resource-efficient approach to make the kernel-based LCW algorithm realizable in practice, especially in a resource-scarce environment. To break the curse, we present a budgeted algorithm, which puts a limit on the maximum number of SEs for constructing the kernel-based hypothesis. The concept of the budgeted algorithm follows the spirit of the LCW algorithm, and thus we call it the budgeted LCW (BLCW) algorithm. Figure2 demonstrates the concept of the proposed BLCW algorithm. In the rest of this section, we first introduce the resource perspective on the kernel-based LCW algorithm. Based on this perspective, we propose the BLCW algorithm and then study four SE removal strategies for maintaining the budget.
[IMAGE OMITTED. SEE PDF]
Resource Perspective
Let us first delve into the hypothesis used by the kernel-based LCW algorithm. The kernelization of LCW as in Section 2.3 implies that Equation (14a,b) is the mean and covariance functions of with because ofwhere computes the expectation. In addition, can be shown to be a Gaussian process.[46] Thus, the hypothesis used by the kernel-based LCW algorithm on round t is actually a Gaussian process characterized by the kernel-based mean and covariance functions, Equation (14a,b),
Similarly, the updated hypothesis is also a Gaussian process characterized by another kernel-based mean and covariance functions, Equation (15a,b),
As a consequence of the constrained optimization problem (Equation (8)), the update adjusts the combination coefficients associated with SEs and SE pairs. This motivates us to investigate whether somehow we can directly control the combination coefficients of available SEs and SE pairs and obtain a new update rule for the kernel-based LCW algorithm.
Wu et al.[47] introduced the resource perspective which gives us a clue about how to control the combination coefficients directly. With the resource perspective, we treat each instance as a possible resource that can be used as an SE for constructing a kernel-based hypothesis. On round t, we consider SEs as available resources which are utilized to construct the hypothesis for prediction; we consider the combination coefficients associated with SEs and SE pairs as utilization degrees of the SEs and SE pairs. When it comes to the update part, we first have to decide which instances are the available resources that are responsible for prediction for the next round and determine the corresponding utilization degrees. Instances which are not considered available resources for the updated hypothesis are discarded from the storage and cannot be used in future rounds anymore. It leaves us with the question of how to obtain a new update rule by applying the resource perspective while keeping the spirit of the LCW algorithm, which seeks minimum change in successive distributions while ensuring a sufficiently large margin.
Bearing the resource perspective in mind, we start from the kernel-based hypothesis. Assume the hypothesis on round t is a Gaussian process characterized by some kernel-based mean function and some kernel-based covariance function where is some linear combination of kernels centered on a set of some SEs, and is some linear combination of outer products of kernels centered on the same set of SEs. is the index set of corresponding SEs. We further assume that for any . Note that the symbol is used to emphasize that Equation (20a–c) and (18a–c) may be different in SEs and the combination coefficients, even with the same sequence of rounds.
After receiving the instance , we predict the class label by using the mean function ,
Once the correct class label is received, we update the Gaussian process for the next round based on the current example . The spirit of the LCW algorithm suggests to check whether the current Gaussian process has a sufficiently large margin for the current example or satisfies the following inequality
If Equation (22) is satisfied, we keep the same Gaussian process for the next round,
Otherwise, without putting any constraint on the number of resources, the resource perspective suggests selecting all available instances including the current instance as available resources for constructing the new Gaussian process. We seek a new Gaussian process characterized by some kernel-based mean function and some kernel-based covariance function ,where is some linear combination of kernels centered on the new set of SEs, and is some linear combination of outer products of kernels centered on the same new set of SEs. corresponds to the new index set. We further assume that for any . That leaves us to do is the determination of the utilization degrees, 's and 's.
Note that the assumption of Equation (20b,c) allows us to have the following decomposition for them,which imply Equation (20a) can be treated as a linear hypothesis with . It is equivalent to that is determined by . Similarly, Equation (24b,c) can be decomposed aswhich imply that Equation (24a) is a linear hypothesis with . It is equivalent to that is determined by . Following the spirit of the LCW algorithm, we seek the utilization degrees for the next round that solve the following constrained optimization problemwhere the objective function measures the difference between two Gaussian processes, and , and the constraint ensures that the updated Gaussian process has a sufficiently large margin on the current example . Note that with the confidence parameter . It is worth to emphasize that Equation (27) shares the same spirit with Equation (8), but they are different optimization problems since we seek the utilization degrees of available resources in Equation (27), not the mean vector and covariance matrix in Equation (8). Finally, substituting the solution of Equation (27) back into Equation (24a–c) leads to the new Gaussian process for the next round.
It turns out that combining the resource perspective and the spirit of the LCW algorithm gives an update rule the same as that of the kernel-based LCW algorithm. Moreover, the prediction results will be the same for the entire sequence of rounds with a proper initialization for the approach from the resource perspective. A formal statement of the results is given in the following proposition.
Proposition 2.Suppose all kernel matrices of encountered instances are strictly positive definite. Let the resource perspective combining the spirit of the LCW algorithm use the following setting.
1. Initialize the Gaussian process as follows
2. Predict the class using the mean function on roundt,
3. Update the Gaussian process for roundby solving Equation (27) when the margin inequality (Equation (22)) is violated, otherwise keep the same by Equation (23a–c).
This setting follows the same prediction path made by the kernel-based LCW algorithm, that is, for :whereis the Gaussian process employed by the kernel-based LCWalgorithm on roundt. By the equality signs of Equation (30b,c), we mean , , and .
The proof is given in Appendix. Instead of mean vectors and covariance matrices of Gaussian distributions, the resource perspective considers the viewpoint of the utilization degrees of available resources for Gaussian Processes. Proposition 2 assures that the resource perspective and the kernel-based LCW algorithm use the same sequence of Gaussian processes for prediction. Hence, an alternative and equivalent interpretation for the kernel-based LCW algorithm is given by the resource perspective. It allows us to propose an approximate algorithm for the kernel-based LCW algorithm under the constraint of a limited number of available resources.
BLCW Algorithm
The kernel-based LCW algorithm is subject to the curse of kernelization, because the size of the Gaussian process for prediction, in terms of the size of the support index set, increases whenever the margin inequality (cf. line 7 in Algorithm 1) is violated. The violations are even more frequent in the presence of noise. It may cause unbounded growth of the Gaussian process in runtime and memory consumption and thus make the kernel-based LCW algorithm broken in practical applications where only finite computational resources are affordable. If we can control the size of the Gaussian process, we will be able to break the curse. Hence, we propose the BLCW algorithm that puts a limit on the maximum number of SEs by a predefined budget B. In a practical application, we can set the budget B by considering the available computational power. We can envision that we have to make some sacrifices in the proposed BLCW algorithm. Fortunately, Proposition 2 suggests that the proposed BLCW algorithm can be an approximate algorithm of the kernel-based LCW algorithm according to the reinterpretation from the resource perspective.
To be more clear about how the proposed BLCW algorithm can approximate the kernel-based LCW algorithm, we first assume the hypothesis on round t is a Gaussian process characterized by some kernel-based mean function and some kernel-based covariance function ,where is some linear combination of kernels centered on a set of some SEs, and is some linear combination of outer products of kernels centered on the same set of SEs. is the corresponding index set. Note that for any . If satisfies the margin constraint enforced by the kernel-based LCW algorithm, i.e., , or the number of current SEs is smaller than the budget B, i.e., , the proposed BLCW algorithm updates the hypothesis the same as the kernel-based LCW algorithm,
We call this kind of update the LCW update. Otherwise, we need to handle the situation where violates the margin constraint, i.e., , and in the meantime the number of current SEs reaches the budget, i.e., . The resource perspective suggests the BLCW algorithm removing one of the current SEs and setting the current instance and the remaining SEs as available resources for constructing the new Gaussian process for prediction on the next round. Let be the removed SE and be the corresponding index. Therefore, the BLCW algorithm seeks a new Gaussian process characterized by some kernel-based mean function and some kernel-based covariance function ,where is some linear combination of kernels centered on the new set of SEs, and is some linear combination of outer products of kernels centered on the same new set of SEs. corresponds to the new index set of the B available resources. Again, we assume that for any .
Note that the assumptions of Equation (31b,c) and (33b,c) allow us to have the following decompositionwhere , , , and can be derived accordingly, but we omit their detailed expressions for simplicity. Equation (34a–d) implies that Equation (31a) and (33a) can be treated as linear hypotheses with Gaussian distributions as follows
Motivated by the resource perspective, we propose that the BLCW algorithm determines the utilization degrees of the B available resources, 's and 's, by solving the following constrained optimization problemwhere the objective function measures the difference between two Gaussian processes, and , and the constraint ensures that the updated Gaussian process has a sufficiently large margin for the current example . We call this type of update the BLCW update. The BLCW update is stated formally by the following proposition.
Proposition 3.Suppose the kernel matrix of theBnew SEs is strictly positive definite and the feature vector of the removed SE can be spanned by those of theBnew SEs. If on roundtthe Gaussian processviolates the margin constraint, i.e.,and the budget is reached, i.e.,, the BLCW algorithm updates as followswhere , , , , , , for , , , ,andwith .
We give the proof in Appendix. At first glance, the BLCW update seems rather complicated. However, it turns out to have a simple interpretation. For ease of understanding, we explain the update by using the equivalent linear hypothesis of Equation (37a–c) with the highly organized form as follows,withwhere , is the orthogonal projection in the Euclidean norm sense, and is the orthogonal projection in the Frobenius norm sense. projects the feature vector onto the subspace spanned by feature vectors of the B new SEs, ; projects the outer product onto the subspace spanned by outer products of feature vectors of the B new SEs, , as followswhere r is the index corresponding to the removed SE.
Note that we split and of the BLCW update into two parts, Equation (40a,b), and four parts, Equation (40c–f), respectively; each part is enclosed by a pair of curly brackets. Comparing carefully Equation (40a–f) to the update of the kernel-based LCW algorithm, Equation (15a,b), we observe that together the first parts of and , i.e., Equation (40a,c), can be interpreted as an LCW update, and the remaining parts, Equation (40b) and (40d–f), correspond to projections related to the removed SE . According to the derivation in this section, we first select the removed SE and then determine the utilization degrees of B available resources. However, the organized form (Equation (40a–f)) suggests that there is an alternative way for the interpretation of the BLCW update: it first updates the same as the LCW update and then performs the removal of the unaffordable resource by projections, which preserve the information of .
In Proposition 3, we include an assumption for the removed SE . The assumption allows to be represented by the B new SEs. Thus, the projection parts, Equation (40b) and (40b–f), vanish if this assumption holds true, and then the BLCW update is equivalent to the LCW update. In practice, this assumption may not hold true and results in the sacrifice showing up as projection errors. Finally, Algorithm2 summarizes the proposed BLCW algorithm.
Remark. Although we develop and explain the BLCW update by using a finite-dimensional feature space mapped by the feature extraction , the proposed BLCW algorithm is presented in a pure kernel-based style. This means the proposed BLCW algorithm and the associated interpretations are applicable to any valid kernel function that realizes an inner product, for example, the Gaussian kernel. , which corresponds to an infinite-dimensional feature space.
In Section 3.2, we propose the BLCW algorithm that is an approximate algorithm for the kernel-based LCW algorithm. The BLCW algorithm puts a limit on the maximum number of SEs by the budget B and performs the LCW update as long as the size of the Gaussian process for prediction is less than the budget, i.e., . Once the budget is reached, the BLCW algorithm performs the BLCW update whenever the Gaussian process has no sufficiently large margin in the current example. Let be the index corresponding to the removed SE on round t. The BLCW algorithm selects the removed SE from the set of current SEs, i.e., , without any other specific removal strategy. Different removal strategies may lead to totally different prediction paths for the same sequence of prediction tasks. Thus, the determination of the removal index is important. In this section, we study four kinds of removal strategies for maintaining the budget.
To facilitate the study, we consider the situation where the BLCW algorithm needs to perform a BLCW update on round t. The interpretation of the BLCW update in Section 3.2 suggests that we can separate out an intermediate Gaussian process that is obtained by performing the LCW update without the subsequent projections,
BLCW with Oldest Removal (BLCW-O)[48]
Because the BLCW algorithm deals with a sequence of consecutive rounds, adjacent examples may be more related than those far apart. This means that the SE with the smaller index may be less helpful in making predictions for future rounds. Thus, it suggests removing the oldest SE,
We compute the computational burden of a BLCW update. The calculation of the matrix inverse dominates the time complexity of BLCW-O. It takes time by computing directly from . However, we can speed up the computation of the matrix inverse by using a recursive approach that exploits the incremental and decremental natures of the kernel matrix. It takes time to compute from the last matrix inverse by two recursive steps: the first one corresponds to shrink to a smaller matrix inverse of size , and the second one is to enlarge back to . Both recursive steps take time. Note that although the recursive approach reduces the computation of to time, it takes at most time to compute the rest parts. Hence, the time complexity of BLCW-O is in total. Moreover, the space complexity of BLCW-O is since it needs to store , , , etc., which dominate the main memory usage.
BLCW with Mean Removal (BLCW-M)
There may exist information loss whenever a BLCW update is performed; the loss comes from the SE removal through the projections. In other words, minimizing the loss between two Gaussian processes, and , may lead to a better prediction performance. Because a prediction is made by using the mean function of a Gaussian process, combination coefficients associated with SEs in the mean function affect the prediction result. The removal of the SE with the smallest magnitude of the combination coefficient in the mean function of is suggested,where the SE removal is selected from instead of .
Compared with BLCW-O, BLCW-M needs additional computation for . it is dominated by the computation of and , which takes time. After determining the removed SE, the time and space complexities of the BLCW update for BLCW-M are the same as those of BLCW-O, i.e., time and space. Hence, both time and complexities of BLCW-O are .
BLCW with Divergence Removal (BLCW-D)
Let us consider another way to minimize information loss. A Gaussian process is characterized by mean and covariance functions, so the information loss should involve both functions. One way to measure the loss is to compute the divergence between two Gaussian processes. The removal of the SE with the smallest KL divergence based on the current example is suggested,where .
For each , we need to compute the corresponding Gaussian process for calculating its divergence with the intermediate Gaussian process . Each takes the same time and complexities of BLCW-O, i.e., time and space. Because is a by-product of the computation of any , the total time complexity of BLCW-D is still dominated by the inspection of all 's. Hence, the time complexity of BLCW-D is in total. In addition, the space complexity of BLCW-D is still because the computation of 's is performed in sequential order, not in parallel.
BLCW with Accuracy Removal (BLCW-A)
Because the goal of online learning is to make as many correct predictions as possible for the sequence of rounds, we need the updated Gaussian process to predict well for future rounds. One way to measure the prediction power of is to compute its accuracy based on available examples. This suggests removing the SE with the maximum accuracy based on available examples, or the minimum number of prediction mistakes,where and is the set of available examples.
For each , we need to compute the corresponding Gaussian process for measuring its prediction power. Similarly, the total time complexity of BLCW-A is dominated by the inspection of all 's; thus, BLCW-A requires time. Of course, the space complexity of BLCW-A is still because the computation of 's is performed in a sequential order, not in parallel.
Experiments
We conducted various numerical experiments on real datasets to show the effectiveness of the proposed BLCW algorithm. The experimental results were organized into three phases. In phase 1, we showed that BLCW approximates LCW, which was used in this section as the abbreviation of the kernel-based LCW algorithm. In phase 2, we compared four kinds of removal strategies that are described in Section 3.3. In phase 3, the competitiveness of BLCW to state-of-the-art budgeted online algorithms was demonstrated.
We have used 10 typical and real datasets, which were collected from practical applications and could be downloaded from the LIBSVM website. Table1 summarizes the detailed information about these datasets. For each dataset, we generated 20 distinct sequences of T examples that were permuted at random; T was equal to the size of the dataset, except for the cod-rna and mnist datasets. Because the computing power of our computer equipment was limited, we set T to one-tenth of the dataset size and sample at random for the cod-rna dataset. For the mnist dataset, it contained handwritten digit images from zero to nine (10 categories in total), so we used it to generate a one-versus-rest problem: we retained all images of digit one and sample images from the rest at random such that (6742 1's and 13 258 R's). The prediction performance was measured by the cumulative mistake rate averaged over 20 sequences. All compared algorithms were kernel-based algorithms with the Gaussian kernel where the hyperparameter γ was chosen from . Similar to the approach used in ref. [33]. γ was determined to have the best prediction performance with the Perceptron algorithm and applied to all algorithms. We set the confidence parameter η of CW, LCW, and BLCW algorithms to 0.55. All experiments were conducted with MATLAB R2017a.
Table 1 Detailed information of datasets ().
Dataseta)
Examples (T)
Size
Features (n)
Liver disorders
Heart
Australian
Diabetes
German
svmguide3
svmguide1
a4a
cod-rna
mnist (1 vs. R)
Approximation of LCW
We verified that BLCW approximated LCW by using the oldest removal strategy; however, the observations held true for the other three strategies. The performance of the cumulative mistake rate is plotted as a function of the number of encountered examples for LCW and BLCW-O in Figure3. The average number of total SEs used by LCW after all T examples were received was given. Small values of budgets were set for BLCW-O such that BLCW-O could perform BLCW updates more frequently. For completeness, BLCW-O with the budget set as the average number of SEs of LCW was provided.
[IMAGE OMITTED. SEE PDF]
Compared with the curves of BLCW-O, we could observe that the curve of LCW was almost always the lowest. This implied that the prediction performance of BLCW-O was worse than that LCW. Among several budgets being performed, BLCW-O with a larger budget tended to have a lower curve that was much closer to the curve of LCW; this meant that BLCW-O with a larger budget achieved a better prediction performance. Eventually, with a large enough budget, BLCW-O and LCW could not be distinguished. We also included the performance of the cumulative mistake rate versus the budget for BLCW-O and LCW in Figure4. The result shows that the mistake rate of BLCW-O got closer to that of LCW as the budget increased. We concluded that BLCW approximated LCW.
[IMAGE OMITTED. SEE PDF]
Removal Strategies
Because different removal strategies might have very different prediction results, we studied the four strategies proposed in Section 3.3 and attempted to suggest some strategy as the default for BLCW. The performance of the cumulative mistake rate was plotted as a function of the budget B for LCW and BLCW in Figure 4. Since LCW was a non-budgeted online algorithm, we depicted it as a horizontal line. Note that although with different paces, curves of all four strategies tend to approach that of LCW when the budget increases. This observation validated our claim that BLCW approximated LCW. Curves of BLCW-O and BLCW-D were overlapping such that it was impossible to distinguish both. The reason was that we set the oldest SE as the removal default in our BLCW-D program and the KL divergence adopted by BLCW-D was almost zero for any candidate SE. Hence, it could not tell the difference between all candidate SEs. We might improve BLCW-D by using the KL divergence based on available examples instead of the current example . However, the corresponding time complexity would grow greatly such that it was not worth giving it a try. Among the four strategies, BLCW-M almost always achieves the lowest mistake rate and thus shows a relatively stable prediction performance. It was because BLCW-M dealt with the combination coefficients of the mean function, which affected the prediction result directly. Hence, we recommended the mean removal strategy as the default for BLCW. In contrast, curves of BLCW-A did not show any obvious performance benefits. It was because the in-sample error of a small sample was easier to have a bias.
Competitiveness
Since the mean removal strategy showed the most stable prediction performance as shown in Section 4.2, we compared it, as the representative of BLCW, with the state-of-the-art budgeted online algorithms. RBP: the RBP algorithm with random removal strategy;[32] Forgetron: the self-tuned Forgetron algorithm by shrinking and oldest removal strategy;[31] Projectron: the Projectron algorithm using the projection strategy;[33] Projectron++: the aggressive version of the Projectron algorithm that includes the notion of margin;[33] FOGD: the FOGD algorithm using random Fourier features;[34] and NOGD: the NOGD algorithm using Nyström-based features.[34]
These algorithms were proposed for solving the problem of online binary classification. RBP randomly removed an SE from the set of current SEs when the budget constraint was reached and included the current instance as the SE by the Perceptron update.[32] Since the removal of RBP was randomized, the performance of RBP was run and averaged over 10 times for each sequence. Forgetron shrank the intermediate hypothesis that was through the Perceptron update and then removed the oldest SE.[31] Both Projectron and Projectron++ bound the number of SEs of the kernel-based Perceptron by projection.[33] Projectron projected the kernel function centered on the current instance to the subspace spanned by kernel functions centered on SEs whenever the magnitude of projection error was less than a sparseness parameter η while Projectron++ introduced the notion of margin on top of Projectron. The number of SEs of Projectron (or Projection++) was bounded, but could not be known in advance. To have a fair comparison, we ran Projectron with different values of η; for each value of η, the corresponding number of SEs averaged over 20 sequences was set as the budget to run all budgeted online algorithms, except for Projectron++, which was run with η. FOGD approximated the shift-invariant kernel by the use of random Fourier features and updated the approximated linear hypothesis by the OGD algorithm.[34] NOGD approximated the kernel matrix through the Nyström method and updated the approximated linear hypothesis by the OGD algorithm.[34] Following the parameter setting in ref. [34], the number of Fourier components for FOGD was set to and the rank approximation parameter for NOGD was set as . The step size η for both algorithms was selected from randomly. Both algorithms were run and averaged 10 times for each sequence.
There were many other well-known classification methods that might be used for comparison. For example, logistic regression,[29] SVMs,[49] and Gaussian process classification[46] are machine-learning-based classification methods. AlexNet,[50] GoogLeNet,[51] and ResNet[52] are deep-learning-based classification methods. These types of methods aimed to learn a fixed classification model for new testing data by using the training data that needed to be collected before learning and thus fell in the realm of offline learning. Here, we only considered those methods that were designed for online learning. Specifically, we included Perceptron, PA, CW, and LCW for comparison. All four algorithms were run with the kernel trick and they belong to non-budgeted online methods.
The performance of the cumulative mistake rate is plotted as a function of the budget B for budgeted and non-budgeted online algorithms in Figure5. Although we did not intend to stress this through the experiments here, LCW outperformed the other non-budgeted methods in most of the datasets. More importantly, Figure 5 shows that the curve of BLCW-M usually achieved the lowest among those of the other budgeted algorithms. This meant that BLCW-M outperformed the other budgeted algorithms and thus supported the claim that our proposed BLCW algorithm was competitive with state-of-the-art budgeted online algorithms.
[IMAGE OMITTED. SEE PDF]
We concluded this section with a brief comparison of the time and space complexities per round for budgeted and non-budgeted online algorithms, as summarized in Table2. Since Perceptron, PA, CW, and LCW were non-budgeted algorithms, their time and space complexities grew up with the number of SEs, . Both complexities were linear in for Perceptron and PA while quadratic for CW and LCW as they involved with the covariance function. In addition, both complexities of all budgeted algorithms studied in this section, including the proposed BLCW algorithm, were constant once the budget value was set. While we did not see the benefits from the complexity comparison at this point, BLCW-M outperformed the other state-of-the-art budgeted algorithms in terms of the prediction performance on the datasets collected from practical applications. Further reduction in complexities would make the proposed BLCW algorithm more accessible.
Table 2 Computational complexity per round for budgeted and non-budgeted online algorithms.
Algorithm
Time complexity
Space complexity
Perceptron
PA
CW
LCW
RBP
Forgetron
Projectron
Projectron++
FOGD
NOGD
BLCW-O
BLCW-M
BLCW-D
BLCW-A
Conclusions
In this paper, we bring the resource perspective into the kernel-based LCW algorithm. The kernel-based LCW algorithm is reinterpreted as an online algorithm that employs a Gaussian process for prediction. The Gaussian process is characterized by mean and covariance functions that consist of kernels of available resources (i.e., SEs) and their utilization degrees. From the resource perspective, we propose the BLCW algorithm to approximate the kernel-based LCW algorithm. Given a limited number of available resources, the proposed BLCW algorithm optimizes their utilization degrees through a constrained optimization problem. We further study four removal strategies for budget maintenance and suggest the mean removal as the default. Through extensive experiments, we have verified that BLCW algorithm is indeed an approximation of the kernel-based LCW algorithm and shown its competitiveness to state-of-the-art budgeted online algorithms.
Ideally, the kernel-based LCW algorithm may serve as an optimal point with an infinite budget (), which corresponds to the situation where unlimited computation resources are supplied. In practice, the computation resources are never enough, and we often cannot do anything in this situation. Therefore, we propose the BLCW algorithm as an alternative solution. However, the figure of the budget cannot fully reveal the actual problem size, which depends on the data type we have to deal with. To build a useful and advanced intelligence system, we need to investigate the impact of the budget on various datasets and problem types in future works. As compared with other budgeted algorithms, evaluations of the computational complexity and efficiency may be refined to address this issue. Eventually, we may come up with a set of application guidelines for suggesting the optimal budget sizes for different scenarios. We will also study how to reduce the computational complexity of the BLCW algorithm furthermore and explore adaptive budget maintenance strategies that adjust based on the data characteristics and dynamics of the environment. Moreover, we hope to extend the current analysis to tackle multiclass classification and regression problems.
Appendix
Appendix
Matrix Derivative Identity
Proofs of Propositions 2 and 3 involve several matrix derivative identities, one of which requires additional derivations and is provided as the following lemma.
Lemma 4.Suppose , , andare matrices that are independent of. Then, the matrix derivative ofis
Proof. Denote by the th entry of a matrix and let . We prove the result by using the chain rule,[53]
The partial derivative of the th entry of with respect to the th entry of is
By Equation (A3), the partial derivative of with respect to the th entry of iswhere is the ith column of and is the jth row of . By Equation (49) from ref. [53] and Equation (A4), the chain rule giveswhere the second equality uses the fact that holds for any two matrices, and , with proper dimensions, and the third equality holds because a scalar value is equal to its transpose. Hence, we havewhich completes the proof.
Proof of Proposition 2
We prove the result by induction. The initialization implies Equation (30a–c) for the first round (i.e., ) is valid. Suppose Equation (30a–c) holds for round t and proceed to prove the case for round . We only need to take care of the situation where the Gaussian process violates the margin inequality (Equation (22)),
Note that and . Let with and , and . Denote by the set of symmetric matrices. We rearrange Equation (25a,b) and (26a,b) into the following matrix representationwhere
By substituting Equation (A8a–d) into Equation (27), we obtain the following constrained optimization problem
Because both the objective and constraint functions of Equation (A9) are convex in , Equation (A9) is a convex optimization problem. We can turn to find a solution satisfying the Karush–Kuhn–Tucker (KKT) conditions.[44] First, we define the Lagrangian of Equation (A9) to bewhere τ is the Lagrange multiplier associated with the inequality constraint and consists of the remaining terms that do not depend on .
Setting partial derivatives of L with respect to to zero givesfrom which we obtain a solution as follows:or
Multiplying both sides of the aforementioned equation by from the left and substituting with Equation (A8b) lead towhere and . Note that although this matrix multiplication is not reversible, later we can still find a solution satisfying Equation (A11). Let . Note that because is the last column of . Multiplying both sides of Equation (A14) by from the left giveswhich indeed satisfies Equation (A11). This can be verified by directly substituting Equation (A15) and (A8b) back into Equation (A11).
Next, we deal with the partial derivatives for . Omitting the symmetry of for now, we calculate the partial derivatives of L with respect to and set them to zero,where the first equality uses Equation (70) and (101) from ref. [53] and Lemma 4. Equation (A16) gives a solution as follows
We calculate the inverse of both sides of the previous equation by applying the Sherman–Morrison formula and obtainwithwhere . Multiplying both sides of Equation (A18) by from the left and from the right and substituting with Equation (A8b) lead to
Similarly, these two matrix multiplications are not reversible; however, we can find a solution satisfying Equation (A16) later. Multiplying both sides of the previous equation by from the left and right giveswhich can be shown to satisfy Equation (A16) by using Equation (A18). Moreover, we can observe that Equation (A21) makes sure of the symmetry of .
The KKT conditions for the convex optimization problem (Equation (A9)) imply that the constraint in Equation (A9) holds with equality,where . Substituting Equation (A15), (A19), and (A21) into the previous equation giveswhere and we have used the fact that . Following a derivation similar to the subsequent derivation for Equation (17) from ref. [21], we obtain
We have shown that there exists a solution satisfying the KKT conditions for Equation (A9), i.e., Equation (A15), (A21), and (A24). Moreover, substituting Equation (A8a–d), (A15), (A19), (A21), and (A24) into Equation (24b–c) and comparing the result with Equation (15a,b), we can observe that 1) and 2) and have the same combination coefficients of SEs, and 3) and have the same combination coefficients of SE pairs. These observations imply that Equation (30a–c) holds for round and thus completes the proof.
Proof of Proposition 3
First, we write down the detailed expressions of , , , and associated with the decomposition (Equation (34a–d)),
By a change of variables with Equation (A25a–d), we turn to find the increments in utilization degrees of the B available resources. and are the increments associated with and , respectively.
Note that with and let . Denote by the set of symmetric matrices. For convenience, we rearrange Equation (A25b–d) into the following matrix representationwhere
Note that for we write out those terms that depend on the removed SE . By substituting Equation (A26b,c) into Equation (36), we get the following constrained optimization problem
Because both the objective and constraint functions of Equation (A27) are convex, Equation (A27) is also a convex optimization problem. Hence, we find a solution satisfying the KKT conditions.[44] The Lagrangian of Equation (A27) is defined aswhere τ is the Lagrange multiplier associated with the inequality constraint and consists of the remaining terms that do not depend on .
Setting partial derivatives of L with respect to to zero giveswhich provides a solution as followsor
Multiplying both sides of the above equation by from the left and substituting with Equation (A26a) lead towhere we recall , , , and . This matrix multiplication is not reversible, but we can still find a solution satisfying Equation (A29) later. Note that . Multiplying both sides of Equation (A32) by from the left giveswhich turns out to be a solution of Equation (A29). This can be verified by substituting Equation (A33) back into Equation (A31) and taking the assumption for the feature vector of , .
Now, we turn to handle the partial derivatives for . Again, omitting the symmetry of for now, we compute the partial derivatives of L with respect to and set them to zero:where the first equality uses Equation (70) and (101) from ref. [53] and Lemma 4. Equation (A34) provides a solution as follows
We compute the inverse of both sides of the previous equation by using the Sherman–Morrison formula and getwithwhere . Multiplying both sides of Equation (A36) by from the left and by from the right and substituting Equation (A26a) lead towith
Similarly, these two matrix multiplications are not reversible, but we can still find a solution satisfying Equation (A34) later. Multiplying both sides of Equation (A38) by from the left and right giveswith
Note that Equation (A40) is a solution of Equation (A34). We can verify it by substituting Equation (A40) and (A41) back into Equation (A36) and adopting the assumption for the feature vector of , . Moreover, Equation (A40) is symmetric.
The KKT conditions for the convex optimization problem (Equation (A27)) imply that the constraint in Equation (A27) holds with equality,
By substituting Equation (A33), (A37), (A40), and (A41) into the previous equation and rearranging with some efforts, we getwhere and we have used the fact that . Again, by following a derivation similar to the subsequent derivation for Equation (17) from ref. [21], we get
Note that Equation (A33), (A40), and (A44) are a solution satisfying the KKT conditions for Equation (A27). Therefore, Equation (A33), (A37), (A40), (A41), and (A44) give the BLCW update rule. ▪
Acknowledgements
This work received funding from various sources, including the National Science and Technology Council (grants: 110-2811-M-A49-550-MY2, 112-2811-M-A49-557-, 112-2634-F-A49-003-, 113-2118-M-A49-007-MY2, 113-2923-M-A49-004-MY3), the Higher Education Sprout Project of the National Yang Ming Chiao Tung University from the Ministry of Education, Taiwan, and the Biomedical Artificial Intelligence Academy of the Kaohsiung Medical University, Taiwan. We also thank Wan-Yi Tai for her valuable assistance and acknowledge the National Center for High-performance Computing for providing computing resources.
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Y. Freund, R. E. Schapire, Mach. Learn. 1999, 37, 277.
C. Gentile, Mach. Learn. 2003, 53, 265.
J. Kivinen, A. J. Smola, R. C. Williamson, IEEE Trans. Signal Process. 2004, 52, 2165.
N. Cesa‐Bianchi, G. Lugosi, Prediction, Learning, and Games, Cambridge University Press, Cambridge 2006.
S. Shalev‐Shwartz, Found. Trends Mach. Learn. 2012, 4, 107.
S. C. H. Hoi, J. Wang, P. Zhao, J. Mach. Learn. Res. 2014, 15, 495.
S. C. H. Hoi, D. Sahoo, J. Lu, P. Zhao, Neurocomputing 2021, 459, 249.
E. Hazan, Introduction to Online Convex Optimization, arXiv:1909.05207 2019.
K. Jun, F. Orabona, in Proc. COLT’19, Phoenix, AZ 2019, pp. 1802–1823.
E. Hazan, E. Minasyan, in Proc. COLT’20, Virtual Event, Graz, Austria 2020, pp. 1877–1893.
H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, J. Kubica, in Proc. KDD’13, Chicago, IL 2013, pp. 1222–1230.
J. Lu, S. C. H. Hoi, J. Wang, in Proc. ACML’13, Canberra, ACT, Australia 2013, pp. 325–340.
J. Ma, L. K. Saul, S. Savage, G. M. Voelker, in Proc. ICML’09, Montreal, Quebec, Canada 2009, pp. 681–688.
B. Li, P. Zhao, S. C. H. Hoi, V. Gopalkrishnan, Mach. Learn. 2012, 87, 221.
F. Rosenblatt, Psychol. Rev. 1958, 65, 386.
A. B. J. Novikoff, in Proc. Symp. on the Mathematical Theory of Automata, New York, NY 1962, Vol. 12, pp. 615–622.
C. Gentile, J. Mach. Learn. Res. 2001, 2, 213.
M. Zinkevich, in Proc. ICML’03, Washington, DC 2003, pp. 928–936.
K. Crammer, O. Dekel, J. Keshet, S. Shalev‐Shwartz, Y. Singer, J. Mach. Learn. Res. 2006, 7, 551.
M. Dredze, K. Crammer, F. Pereira, in Proc. ICML’08, Helsinki, Finland 2008, pp. 264–271.
K. Crammer, M. Dredze, F. Pereira, J. Mach. Learn. Res. 2012, 13, 1891.
K. Crammer, M. Dredze, F. Pereira, in Proc. NIPS’08, Vancouver, British Columbia, Canada 2008, pp. 345–352.
J. Wang, P. Zhao, S. C. H. Hoi, in Proc. ICML’12, Edinburgh, Scotland, UK 2012.
J. Wang, P. Zhao, S. C. H. Hoi, ACM Trans. Intell. Syst. Technol. 2016, 8, 15:1.
K. Crammer, Y. Singer, J. Mach. Learn. Res. 2003, 3, 951.
S. Matsushima, N. Shimizu, K. Yoshida, T. Ninomiya, H. Nakagawa, in Proc. SDM’10, Columbus, OH 2010, pp. 303–314.
K. Crammer, M. Dredze, A. Kulesza, in Proc. EMNLP’09, Singapore 2009, pp. 496–504.
B. Schölkopf, A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA 2001.
C. M. Bishop, Pattern Recognition and Machine Learning, Springer‐Verlag, Berlin 2006.
K. Crammer, J. Kandola, Y. Singer, in Proc. NIPS’03, Vancouver and Whistler, British Columbia, Canada 2003, pp. 225–232.
O. Dekel, S. Shalev‐Shwartz, Y. Singer, J. Comput. 2008, 37, 1342.
G. Cavallanti, N. Cesa‐Bianchi, C. Gentile, Mach. Learn. 2007, 69, 143.
F. Orabona, J. Keshet, B. Caputo, J. Mach. Learn. Res. 2009, 10, 2643.
J. Lu, S. C. H. Hoi, J. Wang, P. Zhao, Z.‐Y. Liu, J. Mach. Learn. Res. 2016, 17, 1.
J. Wang, S. C. H. Hoi, P. Zhao, J. Zhuang, Z. Liu, in Proc. IJCAI’13, Beijing, China 2013, pp. 1750–1756.
J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, A. Bouchachia, ACM Comput. Surv. 2014, 46, 44:1.
A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, J. Mach. Learn. Res. 2010, 11, 1601.
A. Bifet, G. Holmes, B. Pfahringer, in Proc. ECML PKDD’10, Barcelona, Spain 2010, pp. 135–150.
H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfahringer, G. Holmes, T. Abdessalem, Mach. Learn. 2017, 106, 1469.
A. Cano, B. Krawczyk, Mach. Learn. 2020, 109, 175.
B. Krawczyk, A. Cano, in Proc. IJCAI’19, Macao, China 2019, pp. 2763–2771.
L. Korycki, A. Cano, B. Krawczyk, in Proc. IEEE BigData’19, Los Angeles, CA 2019, pp. 2334–2343.
S. You, H. Lin, in Proc. PAKDD’16, Auckland, New Zealand 2016, pp. 115–126.
S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, NY 2004.
Z. Wang, K. Crammer, S. Vucetic, J. Mach. Learn. Res. 2012, 13, 3103.
C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, MA 2006.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Longer documents can take a while to translate. Rather than keep you waiting, we have only translated the first few paragraphs. Click the button below if you want to translate the rest of the document.
Online learning aims to solve a sequence of consecutive prediction tasks by leveraging the knowledge gained from previous tasks. Linearized confidence‐weighted (LCW) learning is the first online learning algorithm introducing the concept of weight confidence into the prediction model through distributions over weights. It provides the flexibility for weights to update their values at different scales. The kernel trick in machine learning can be applied to LCW for a better prediction performance. However, the kernel‐based LCW algorithm is subject to the curse of kernelization which makes it vulnerable to the unlimited growth of the prediction model in runtime and memory consumption. In this study, we present the budgeted LCW (BLCW) algorithm which puts a limit on the growth by a predefined budget with optimization. Consequently, BLCW performs the LCW update and then reduces the information loss by projection. Based on the resource perspective that reinterprets LCW in terms of resources and utilization degrees, we demonstrated that BLCW approximates the kernel‐based LCW algorithm. We evaluate four budget maintenance strategies and suggest that the mean removal is the most stable. By various numerical experiments on real datasets, we demonstrate that BLCW performs competitively and effectively when compared to leading budgeted online algorithms.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Longer documents can take a while to translate. Rather than keep you waiting, we have only translated the first few paragraphs. Click the button below if you want to translate the rest of the document.
Details
Title
Online Linearized Confidence‐Weighted Learning on a Budget
1 Biomedical Artificial Intelligence Academy, Kaohsiung Medical University, Kaohsiung, Taiwan, Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
2 Institute of Data Science and Engineering, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
3 Biomedical Artificial Intelligence Academy, Kaohsiung Medical University, Kaohsiung, Taiwan, Institute of Statistics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan, Institute of Data Science and Engineering, National Yang Ming Chiao Tung University, Hsinchu, Taiwan, School of Post‐Baccalaureate Medicine and Chung‐Ho Memorial Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan, Department of Statistics and Data Science, Cornell University, Ithaca, NY, USA
4 Institute of Electronics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan