(ProQuest: ... denotes non-US-ASCII text omitted.)
Shao-Gao Lv 1 and Jin-De Zhu 2
Recommended by Gerd Teschke
1, Statistics School, Southwestern University of Finance and Economics, Chengdu 611130, China
2, The 2nd Geological Party of Bureau of Geology and Mineral Resources, Henan, Jiaozuo 450000, China
Received 24 April 2012; Accepted 10 July 2012
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
1.1. Overview of Multiple Kernel Learning
Kernel methods such as Support Vector Machines (SVMs) have been extensively applied to supervised learning tasks such as classification and regression. The performance of a kernel machine largely depends on the data representation via the choice of kernel function. Hence, one central issue in kernel methods is the problem of kernel selection; a great many approaches to selecting the right kernel have been studied in the literature [1-4] and other references therein.
We begin with reviewing the classical supervised learning setup. Let (X,d) be a compact metric space and Y⊆... , given a labeled sample z={(xi ,yi )}i=1m ⊆Z:=X×Y , sampled i.i.d . according to an unknown distribution ρ supported on Z , the goal is to estimate a real-valued function fz depending on the sample, that generalizes well on new and unseen data. A widely used approach to estimate a function from empirical data consists in minimizing a regularization functional in a Hilbert space [Hamiltonian (script capital H)] of real-valued functions: X[arrow right]... . Typically, a regularization scheme estimates f as a minimizer of the functional [figure omitted; refer to PDF] where ...z (f)=(1/m)∑i=1m ...V(f(xi ),yi ) is the empirical risk of hypothesis f , measured by a nonnegative loss function V:...×Y[arrow right]...+ . In addition, Ω:[Hamiltonian (script capital H)][arrow right]... is a regularizer and λ>0 is a trade-off regularization parameter.
In this paper, we assume that [Hamiltonian (script capital H)] is a reproducing kernel Hilbert space (RKHS) [Hamiltonian (script capital H)]K with kernel K , see [5]. Every kernel K corresponds to a feature mapping ΨK :X[arrow right][Hamiltonian (script capital H)]K satisfying K(x,y)=Y9;ΨK (x),ΨK (y)YA;K , and each element of [Hamiltonian (script capital H)]K has the following form: [figure omitted; refer to PDF]
By restricting the regularization to be the form Ω(f)=||f||K2 , there is a lot of studies from different perspectives such as statistics, optimal recovery and machine learning [6-9], and other references therein. Regularization in an RKHS has a number of attractive features, including the availability of effective error bounds and stability analysis relative to perturbations of the data (see Cucker and Smale [7]; Wu et al. [10]; Bousquet and Elisseeff [6]). Moreover, the optimization problem (1.1) in an RKHS can be reduced to seek for solution in a finite-dimensional space. Although it is simple to prove, this result shows that the variational problem (1.1) can be computational easily.
Because of their simplicity and generality, kernels and associated RKHS play an increasingly important role in Machine Learning, Pattern Recognition and Artificial Intelligence. When the kernel is fixed, an immediate concern is the choice of the regularization parameter λ . This is typically solved by means of cross validation or generalized cross validation [11]. However, the performance of kernel methods critically relies on the choice of the kernel function. A natural question is how to choose the optimal kernel in a collection of candidate kernels.
Kernel learning can range from the width parameter selection of Gaussian kernels [9, 12] to obtaining an optimal linear combination from a set of finite candidate kernels. The latter is often referred to as multiple kernel learning in machine learning and nonparametric group Lasso in statistics [13]. Lanckriet et al. [3] pioneered work on MKL and proposed a semidefinite programming approach to automatically learn a linear combination of candidate kernels for the cases of SVMS. To improve computation efficiency, the multikernel class further is restricted to only convex combinations of kernels [2, 14, 15]. Most learning kernel algorithms are based on considering linear kernel mixtures Kθ =∑...θkKk ,(θk ...5;0) with a prescribed kernels K1 ,...,KM . For notational simplicity, we will frequently use Ψk instead of the standard feature ΨKk . Compared to (1.1), the primal model for learning with multiple kernels is extended to [figure omitted; refer to PDF]
In this paper, we mainly focus on the lp -norm MKL, consisting in minimizing the regularized empirical risk with respect to the optimal kernel mixture ∑k=1M ...θkKk , in addition to lp -regularizer on θ to avoid overfitting. This leads to the following optimization problem: [figure omitted; refer to PDF] This scheme was introduced in [2] and the existence of its minimum has been discussed in [4].
The optimization problem subsumes state-of-the-art approaches to multiple kernel learning, covering sparse and nonsparse MKL by arbitrary lp -norm regularization ( 1...4;p...4;∞ ) on the mixing coefficients as well as the incorporation of prior knowledge by allowing for nonisotropic regularizer. Kloft et al. [2] developed two efficient interleaved optimization strategies for the lp -norm multiple kernel learning, and this interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on real-world problems from computational biology. An analysis of this model, based on Rademacher complexities, was first developed by Cortes et al. [1]. Later improved rates of convergence were derived based on the theory of local Rademacher complexities [15]. However, the estimate on local Rademacher complexities with 1...4;p<2 strictly depends on no-correlation assumption of the M different features, which is too strong condition in theory and practice. In this paper, we employ the notion of empirical covering number to present a theoretical analysis of its generalization error. Besides no-correlation condition is not necessary, empirical covering number is one tight upper bound of local Rademacher complexities [16], also independent of the underlying distribution. We will see that some satisfying learning rates are established when the regularization parameter is appropriately chosen. The interaction between the sample error and the approximation error plays an important role in our analysis, and our new methodology mainly depends on the complexity of hypothesis class measured by empirical covering number and the regularity of a target function.
It should be pointed out that the Tikhonov Regularization in (1.4) has two regularization parameter ( λ , μ ), which may be hard to deal with in practice. Fortunately, an alternative approach has been studied by Rakotomamonjy et al. [14] and Kloft et al. [2]. More precisely, this approach employs the regularizer ||θ||lp ...4;1 as an additional constraint into the optimization problem. By substituting wk for θwk , they arrive at the following problem: [figure omitted; refer to PDF]
1.2. Algorithm and Main Consequence
The following Lemma (see [4]) indicates that the above multikernel class can equivalently be represented as a block-norm regularized linear class in the product Hilbert space [Hamiltonian (script capital H)]=[Hamiltonian (script capital H)]K1 ×...×[Hamiltonian (script capital H)]KM .
Lemma 1.1.
If p>0 , q=1+(1/p) , and {aj ,j∈...n }⊆... , then [figure omitted; refer to PDF] and the equality occurs for ∑j∈...n ...|aj |>0 at [figure omitted; refer to PDF]
Hence, Lemma 1.1 can be applied to define the feature mapping: Ψ:x∈X[arrow right](Ψ1 (x),...,ΨM (x))∈[Hamiltonian (script capital H)]K~ associated with a kernel K~ ; the class of functions defined above coincides with [figure omitted; refer to PDF] when p∈[1,∞] , q∈[1,2] holds from q=2p/(p+1) . The l2,q -norm is defined here as ||w||2,q =(∑j=1M ...||wj ||Kj q )1/q . For simplicity, we write ||fw||2,q =||w||2,q . Clearly learning the complexity of (1.8) will be greater than one that is based on a single kernel only, further it provides greater learning ability while the computational complexity increases accordingly. The sample complexity of the above hypothesis space has been studied by Cortes et al. [1] and Kloft and Blanchard [15]. Thus the primal MKL optimization problem (1.5) is equivalent to the following regularization scheme, which is the primary object of investigation in this paper [figure omitted; refer to PDF] Here we use the symbol "min" instead of "inf," since (1.4) is equivalent to (1.9) and the solution of (1.4) exists and is unique. Remark that the above algorithm is a standard regularized empirical risk minimization; this implies that lp -norm multiple kernel learning scheme can be free of over-fitting, a phenomenon which occurs when the empirical error is zero but the expected error in far from zero.
In the following, we assume that {Kj}j=1,...,M is uniformly bounded, that is, [figure omitted; refer to PDF] Also suppose that each Kj is continuous. In other words, each Kj is a Mercer kernel with bound κ ; we refer to [17] for more properties and discussions on Mercer kernel.
In this paper, we only focus on the least square loss: V(f(x),y)=(f(x)-y)2 . Accordingly, the target function is given by [figure omitted; refer to PDF] where we denote by ρ(·|"x) the conditional distribution of ρ . Through this paper we assume that ρ(·|"x) is supported on [-T,T] , it follows that |fρ (x)|...4;T for x∈X almost surely. Since the learner fz may be much larger than fρ , it is natural to apply a projection operator on fz , which was introduced into learning algorithms to improve learning rates.
Definition 1.2.
The projection operator π is defined on the space of measurable functions f:X[arrow right]... as [figure omitted; refer to PDF] where T>0 is called the projection level.
The target of error analysis is to understand how π(fz ) approximates the regression function fρ . More precisely, we aim to estimate the excess generalization error [figure omitted; refer to PDF] for the lp -norm MKL algorithm (1.4), where ...(f)=E(f(x)-y)2 denotes the expect error of f .
To show some ideas of our error analysis, we first state learning rates of (1.4) in a special case when fρ ∈[Hamiltonian (script capital H)]K~ and K~ is C∞ on X⊂...n .
Theorem 1.3.
Let fz be defined by (1.9). Assume K~ is C∞ on X⊂...n and fρ ∈[Hamiltonian (script capital H)]K~ . For any 0<δ<1 and ...>0 , with confidence 1-δ , there holds [figure omitted; refer to PDF] where C~ is some constant independent of m or δ .
Theorem 1.3 can be viewed as a corollary of our main result presented in Section 5. It can be arbitrary close to ...AA;(m-1 ) by choosing ... to be small enough, which is the best convergence rate in learning theory literature.
2. Key Error Analysis
Our main result is about learning rates of (1.4) stated under conditions on the approximation ability of [Hamiltonian (script capital H)]K~ with respect to fρ and capacity of [Hamiltonian (script capital H)]K~ .
The approximation ability of the hypothesis space [Hamiltonian (script capital H)]K~ with respect to fρ in the space LρX 2 is reflected by regularization error.
Definition 2.1.
The regularization error of the triple ([Hamiltonian (script capital H)]K~ ,fρ ,ρX ) is defined as [figure omitted; refer to PDF] We will assume that for some 0<β...4;1 and Cβ >0 , [figure omitted; refer to PDF]
Remark 2.2.
Our assumption implies that when fρ is replaced by [Hamiltonian (script capital H)]K~ , ...9C;q (λ) tends to zero by polynomial order decay as λ goes to zero. Note [7] that ...9C;q (λ)=o(λ) would imply fρ =0 . So β=1 in (2.2) is the best we can expect. This case is equivalent to fρ ∈[Hamiltonian (script capital H)]K~ when [Hamiltonian (script capital H)]K~ is dense in LρX 2 , see [18]. Assumption (2.2) with 0<β<1 can be characterized in terms of interpolation spaces [7].
If ρX is the Lebesgue measure on X and the target function fρ ∈Hs , a Sobolev space with power s . When Gaussian kernel (Gσ (x,y)=exp (-σ||x-y||2 )) is taken with a fixed variance σ , a polynomial decay of ...9C;q (λ) is impossible. However, Example 1 of [19] successfully obtains a polynomial decay under the multikernel setting, allowing for varying variances of Gaussian kernels. This shows that multikernel learning can improve the approximation power and learning ability. More interestingly, we will take a special example to show the impact of the multikernel class on the regularization error in Section 5 below. In particular, a proper multikernel class can be applied to improve the regularization error if the regularity of fρ is rather high.
Next we define the truncated sample error as [figure omitted; refer to PDF] and the sample error as [figure omitted; refer to PDF]
The function f in the above equation can be arbitrarily chosen; however, only proper choices lead to good estimates of the regularization error. A good choice is f=fλ where [figure omitted; refer to PDF]
A useful approach for regularization schemes with sample independent hypothesis spaces such as RKHS is an error decomposition, which decomposes the total error ...(π(fz ))-...(fρ ) into the sum of the truncated sample error and the regularization error stated as follows.
Proposition 2.3.
Let fλ be defined by (2.5); there holds [figure omitted; refer to PDF]
Proof.
We can decompose ...(π(fz ))-...(fρ ) into [figure omitted; refer to PDF] To bound the second term, by Definition of fz , ...z (π(fz ))+λ||fz||2,q2 can be bounded by [figure omitted; refer to PDF] since |π(f)(x)-y|...4;|f(x)-y| holds for any function f on Z . The conclusion follows by combining these two inequalities.
3. Estimation on Sample Error
We are in a position to estimate the sample error ...AE;z (λ,fλ ,T) . The sample error ...AE;z (λ,fλ ,T) can be written as [figure omitted; refer to PDF]
...AE;2 can be easily bounded by applying the following one-side Bernstein-type probability inequality.
Lemma 3.1.
Let ξ be a random variable on a probability space Z with variance σ2 satisfying |ξ-E(ξ)|...4;Mξ for some constant Mξ . Then for any 0<δ<1 , we have [figure omitted; refer to PDF]
Proposition 3.2.
Define the random variable ξ2 (z)=(fλ (x)-y)2 -(fρ (x)-y)2 . For every 0<δ<1 , with confidence at least 1-δ/2 , there holds [figure omitted; refer to PDF]
Proof.
From the definition of ...9C;q (λ) , we see that [figure omitted; refer to PDF] Note that fλ (x)=Y9;wλ ,Ψ(x)YA; for some wλ =(wλ(1) ,...,wλ(M) )∈[Hamiltonian (script capital H)]K~ , by the Cauchy-Schwarz inequality (C.-S.) , we have for any x∈X : [figure omitted; refer to PDF] where (1/q)+(1/q* )=1 . Using Assumption (1.10), it follows that [figure omitted; refer to PDF]
Observe that [figure omitted; refer to PDF] Since that |fρ (x)|...4;T almost surely, we have [figure omitted; refer to PDF] Hence |ξ2 -E(ξ2 )|...4;Mξ2 :=2c . Moreover, E(ξ2)2 equals [figure omitted; refer to PDF] which implies that σ2 (ξ2 )...4;E(ξ22 )...4;c...9C;q (λ) . The desired result follows from Lemma 3.1.
Next we estimate the first term ...AE;1 . It is more difficult to deal with because it involves a set of random variables fz varying with z , requiring to consider the functional complexity. For this purpose, we introduce the notion of empirical covering numbers, which often lead to sharp error estimates [16].
Definition 3.3.
Let (U,d) be a pseudometric space and S⊂U . For every ...>0 , the covering number ...A9;(S,...,d) of S with respect to ... and d is defined as the minimal number of balls of radius ... whose union covers S , that is, [figure omitted; refer to PDF] where B(sj ,...)={s∈U:d(s,sj )...4;...} is a ball in U .
The l2 -empirical covering number of a function set is defined by means of the normalized l2 -metric d2 on the Euclidian space ...k given by [figure omitted; refer to PDF]
Definition 3.4.
Let ... be a set of function on X , x=(xi)i=1k ⊂Xk and ...|x ={(f(xi ))i=1k :f∈...}⊂...k . Set ...A9;2,x (...,...)=...A9;(...|x ,...,d2 ) . The l2 -empirical covering number of ... is defined by [figure omitted; refer to PDF]
Denote by BR the ball of radius R with R>0 , BR ={f∈[Hamiltonian (script capital H)]K~ : ||f||K~ ...4;R} . We need the following capacity assumption on [Hamiltonian (script capital H)]K~ .
Assumption 3.5.
There exists an exponent [upsilon] , with 0<[upsilon]<2 and a constant C[upsilon],K~ >0 such that [figure omitted; refer to PDF] where B1 is the unit ball of [Hamiltonian (script capital H)]K~ defined as above.
For any function fw =Y9;w,Ψ(x)YA;[Hamiltonian (script capital H)]K~ , by the hölder inequality, we have [figure omitted; refer to PDF] and it follows from (3.13) [figure omitted; refer to PDF] where B1q is called the generalized unit ball of [Hamiltonian (script capital H)]K~ associated with q , defined by [figure omitted; refer to PDF]
Note that for any function set ...⊆C(X) , the empirical covering number ...A9;2,x (...,...) is bounded by ...A9;(...,...) , the (uniform) covering number of ... under the metric ||·||∞ , since d2 (f,g)...4;||f-g||∞ . It was shown in [20] that the quantity log (...A9;(B1 ,...))...4;C0 (1/...)s holds for some C0 >0 if K is C2n/s on a subset of ...n , hence log (...A9;2 (B1 ,...))...4;C0 (1/...)s also holds. In particular, s is arbitrarily small for a C∞ kernel ( such as Gaussian kernel). Now we give a concrete example in ...n to reveal relationship between the regularity of function class and its corresponding empirical covering number.
Example 3.6.
Let X be a bounded domain in ...n and Hs the Sobolev space of index s . When s>n , the classical Embedding Theorem tells us that Hs (X) is an RKHS and its unit ball B1 is embedded in a finite ball of the function space Cs-(n/2)-ζ (X) with inclusion bounded where 0<ζ<s-n . From the classical bounds for covering numbers of the unit ball of Cs-(n/2)-ζ (X) , we see that [figure omitted; refer to PDF] Hence Assumption (3.13) below holds with [upsilon]=n/(s-n/2-ζ)<2 .
Our concentration estimate for the sample error dealing with ...AE;1 is based on the following concentration inequality, which can be found in [12].
Lemma 3.7.
Let ... be a class of measurable functions on Z . Assume that there are constants B,c>0 and η∈[0,1] such that ||f||∞ ...4;B and Ef2 ...4;c(Ef)η for every f∈... . If for some a>0 and [upsilon]∈(0,2) , [figure omitted; refer to PDF] then there exists a constant c[upsilon] such that for any t>0 , with confidence at least 1-e-t , there holds [figure omitted; refer to PDF] where [figure omitted; refer to PDF]
Denote the set of function ...Rq with R>0 , where [figure omitted; refer to PDF]
Proposition 3.8.
If B1q satisfies the capacity condition (3.13) with some 0<[upsilon]<2 , then for any δ∈(0,1) , with confidence 1-δ/2 , there holds [figure omitted; refer to PDF] with constant C[upsilon],T =4c[upsilon](C[upsilon],K~T2 )[upsilon]/(2+[upsilon])M2[upsilon]/(2+[upsilon])q* .
Proof.
Consider the set ...Rq . Each function g∈...Rq can be expressed as g(z) = (y-π(f)(x))2 -(y-fρ (x))2 with some f∈BRq . Then Eg=...(π(f))-...(fρ )=||π(f)-fρ||ρ2 and (1/m)∑i=1m ...g(zi ) = ...z (π(f))-...z (fρ ) . Note that [figure omitted; refer to PDF] Since |π(f)(x)|...4;T and |fρ (x)|...4;T for any x∈X , we see that for any z∈Z , [figure omitted; refer to PDF] On the other hand, for any g1 ,g2 ∈...Rq at point z=(x,y) , we have [figure omitted; refer to PDF] since the projector operator π is a contractive map. It follows that [figure omitted; refer to PDF] It follows from the capacity condition (3.15) [figure omitted; refer to PDF]
Applying Lemma 3.7 with B=c=4T2 , η=1 , and a=C[upsilon],K~ (4TRM1/q*)[upsilon] , we see that for any δ∈(0,1) , with confidence 1-δ/2 , there holds [figure omitted; refer to PDF] Besides, following the definition of fz (1.9), we have [figure omitted; refer to PDF] that is, ||fz||2,q ...4;T/λ . Hence we can replace R with T/λ . Thus we derive our desired result.
4. Total Learning Rates
We are now in a position to obtain the learning rates of projected algorithm (1.9). Main results of this paper will be presented in Theorem 4.1.
Following the error decomposition scheme in Proposition 2.3 and combining Propositions 3.2 and 3.8, we derive the following bounds on the total error.
Theorem 4.1.
Suppose that B1q satisfies the capacity condition (3.13) with some 0<[upsilon]<2 , and ...9C;q (λ)...4;Cβλβ . For any δ∈(0,1) , with confidence 1-δ , there holds [figure omitted; refer to PDF] where C[variant prime]=3κ2M2/q* +86T2 +1+C[upsilon],T2[upsilon]/(2+[upsilon]) +Cβ and C[upsilon],T is defined as in Proposition 3.8.
Proof.
Following Propositions 3.2 and 3.8, with confidence at least 1-δ , (1/2)||π(fz )-fρ||ρ2 can be bounded by [figure omitted; refer to PDF] Firstly, we set [figure omitted; refer to PDF] which implies that λ=(1/m) . On the other hand, from the assumption ...9C;q (λ)...4;Cβλβ , we set [figure omitted; refer to PDF] Hence our assertion follows by taking λ=(1/m)min {2/(2β+(1+β)[upsilon]),1} .
Proof of Theorem 1.3.
When K~∈C∞ , it follows that condition (3.13) holds for arbitrary small [upsilon]>0 . Moreover, fρ ∈[Hamiltonian (script capital H)]K~ implies that β=1 , the conclusion follows easily from Theorem 4.1.
Our learning rates below in Corollary 4.2 will be achieved under the regularity assumption on the regression function that fρ lies in the range of LK~r for some r>0 . Given any kernel K , LK is the integral operator on LρX 2 defined by [figure omitted; refer to PDF] The operator LK is linear, compact, positive and can be also regarded as a self-adjoint operator on [Hamiltonian (script capital H)]K . Hence the fractional power operator LKr :LρX 2 [arrow right]LρX 2 is well defined and is given by [figure omitted; refer to PDF] where {λk}k are eigenvalues of the operator LK arranged in a decreasing order and {[straight phi]k}k are the corresponding eigenfunctions, which form an orthonormal basis of LρX 2 . In fact, the image of LKr is contained in [Hamiltonian (script capital H)]K if r...5;1/2 . So LK-rfρ ∈LρX 2 indicates that fρ lies in the range of LKr , measuring the regularity of the regression function.
Corollary 4.2.
Suppose that B1q satisfies the capacity condition (3.13) with some 0<[upsilon]<2 , and LK~-rfρ ∈LρX 2 ( 0<r...4;1 ). For any 0<δ<1 , with confidence 1-δ , there holds [figure omitted; refer to PDF] where C~ is some constant independent of m or δ .
Proof.
Recall a result from [18], if LK~-rfρ ∈LρX 2 ( 0<r...4;1/2 ), there holds [figure omitted; refer to PDF] If r...5;1/2 , this shows that fρ ∈[Hamiltonian (script capital H)]K~ as mentioned above, then we have fλ =fρ and [figure omitted; refer to PDF] Hence for any r∈[0,1] , we have [figure omitted; refer to PDF]
On the other hand, observe Lemma 5.1 below, and we have [figure omitted; refer to PDF] where C* is some constant and q∈[1,2] . The conclusion follows immediately from Theorem 4.1.
Let us compare our learning rates with the existing results.
In [10], a uniform covering number technique was used to derive the expected value of learning schemes (1.1) where Ω(fw )=||w||K2 . If all the kernels {Ki } are the same with some specialized K and LK-rfρ ∈LρX 2 for some 0<r...4;1/2 . For any 0<ζ<1/(1+[upsilon]) and any 0<δ<1 , with confidence 1-δ , then [figure omitted; refer to PDF] Clearly the learning rates derived from Corollary 4.2 are better than that in [10] since 2rζ<2r/(1+[upsilon])...4;4r/(4r+(1+2r)[upsilon]) .
In [21], an operator monotonic technique was used to improve the kernel independent error bounds in comparison with the result in [17]. If LK-rfρ ∈LρX 2 for some 0<r...4;1 . For any 0<δ<1 , with confidence 1-δ , there holds [figure omitted; refer to PDF]
When r...5;1/2 and [upsilon]...4;(2-r)/3r , the learning rate given by Corollary 4.2 is better than the above result.
As for empirical risk minimization (ERM), classical results on analysis of ERM schemes give error bounds between the empirical target function and the regression function. In particular, learning rates of type m-[straight epsilon] with [straight epsilon] arbitrarily close to 1 can be achieved by ERM schemes (see [15]). However, the ERM setting is different from the one on Tikhonov regularization. How to choose the regularization parameter λ=λ(m) , depending on the sample size m , is the essential difficulty for the regularization scheme, even when fρ lies in [Hamiltonian (script capital H)]K . On the other hand, it is obvious that our result is more general than that of [15] since the case for fρ ∉[Hamiltonian (script capital H)]K ( r<1/2 ) is also covered.
5. Discussion on Regularization Error
By our assumptions on M different kernels {Kj}j=1M , we see that [Hamiltonian (script capital H)]K~ is an RKHS generated by the Mercer kernel K~ . There are several standard approximation results on regularization error ...9C;2 (λ) in learning theory (see [17]). Next we establish a tight connection between ...9C;2 (λ) and ...9C;q (λ) with 1...4;q...4;2 .
Lemma 5.1.
Let [Hamiltonian (script capital H)]K~ be a separable RKHS over X associated with a bounded measurable kernel, ρ be a distribution on X×[-T,T] , and 1...4;q...4;2 . If there exist constants C>0 and β>0 such that ...9C;2 (λ)...4;Cλβ , then for all λ>0 we have [figure omitted; refer to PDF]
Proof.
If there exists a function f~λ satisfying [figure omitted; refer to PDF] we see that λ||f~λ||2,22 ...4;Cλβ . We write f~λ (x)=Y9;w~,Ψ(x)YA; with some w~=(w~(1) ,...,w~(M) ) , by the Ho¨lder inequality, we have [figure omitted; refer to PDF] Then we obtain [figure omitted; refer to PDF]
In other words, if ...9C;2 (λ) has a polynomial behavior in λ , then this behavior completely determines the behavior of all ...9C;q (λ) . Thus it suffices to assume that the standard 2 -approximation error function satisfies (2.2).
From statistical effective dimension point of view, we will discuss the impact of the multikernel class [Hamiltonian (script capital H)]K~ on the approximation error ||fλ -fρ||LρX 2 . To estimate this error, note that the regularizing function of ...9C;2 (λ) exists, is unique, and given by [7] [figure omitted; refer to PDF] For simplicity, let M=2 and take a Mercer kernel Ko as the original one, by the classical Mercer theorem, Ko can be expressed as Ko (x,y)=∑k=1∞ ...λk[straight phi]k (x)[straight phi]k (y) . Another kernel we take is KoN =∑k=1∞ ...λkN[straight phi]k (x)[straight phi]k (y) ( 2...4;N∈...+ ). In this case, K~=Ko +KoN =∑k=1∞ ...(λk +λkN )[straight phi]k (x)[straight phi]k (y) . By the fact that fλ -fρ =λ(λI+LK~)-1fρ and assumption LKo -rfρ ∈LρX 2 , we have [figure omitted; refer to PDF]
Let us compare the multikernel class regularization with Tikhonov regularization in [Hamiltonian (script capital H)]Ko when the Mercer kernel Ko is employed. Denote the saturation index as the maximal r so that the approximation error achieves fastest decay rate under the condition LKo -rfρ ∈LρX 2 . Then (5.6) shows the saturation index for multikernel class regularization is N while it is 1 for Tikhonov regularization in [Hamiltonian (script capital H)]Ko , as shown in [17].
In this case, our analysis implies that we should use an alternative kernel with faster eigenvalue decay when the spectral coefficients of the target function decay faster: for example, using KoN instead of Ko . This has a dimension reduction effect. Essentially, we effectively project the data into the principal components of data. The intuition is also quite clear: if the dimension of the target function is small (spectral coefficient r decays fast), then we should project data to those dimensions by reducing the remaining noisy dimensions (corresponding to fast kernel eigenvalue decay N ). In fact, the similar idea under the framework of semisupervised learning has been shown in spectral kernel design methods [22, 23].
In general, for the sample error, there exist rates of convergence which hold independently of the underlying distribution ρ . This is important, as it tells us that we can give convergence guarantees no matter what a kernel is used, even we do not know the underlying distribution. In fact, this is very common in statistical analysis of various machine learning algorithms (see [24]). This decay is usually fast enough for practical use where amounts of samples are available. For the approximation error, however, it is impossible to give rates of convergence which hold for all probability distributions ρ . Hence what determines the learning accuracy is the approximation error. In kernel regression setup, this is determined by the choice of the kernel and enhances the importance of learning kernels [4] and constructing refined kernels [25].
6. Further Study
In the last section, we exclusively discuss sparsity in the case of the square loss regularization functional in (1.1) with the regularizer Ω(f)=||f||K2 in RKHS. We can derive the explicit expression for this functional from [4], in turn which provides improvement and simplification of our algorithm (1.4).
Lemma 6.1.
For any kernel K and positive constant λ , we have that [figure omitted; refer to PDF] where the vector y=(yi ,...,ym)T and K(x) denotes the m×m Gram matrix (K(xi ,xj ):i,j=1,...,m) and Y9;·YA; denotes the standard inner product in Euclidean space.
According to Lemma 6.1, the least square algorithm of (1.4) can be rewritten as a one-layer minimization problem [figure omitted; refer to PDF]
We assume that fρ ∈[Hamiltonian (script capital H)]θ0 for some θ0 ∈...M . Define Jθ :=supp (θ)={k:θk ...0;0} . We say that fρ is sparse relative to θ0 if the cardinally of Jθ0 is far smaller than M .
For p∈[1,pM ] (but pM is close to 1), it would be interesting to show that the solution θz of (6.2) is approximately sparse following the path of θ0 . In some sense, ∑k∉Jθ0 ...(θz)k is very small with a very high probability. A refined analysis of lp -regularized methods was done by Koltchinskii [26] in the case of combination of M basis functions, mainly taking into account the soft sparsity pattern of the Bayes function and establishing several oracle inequalities in statistical sense. Extending the ideas into the kernels learning setting would be of a great significance, because it can provide theoretical support showing that the lp -norm MKL can automatically select good kernels, which coincide with the underlying right kernels.
Acknowledgments
The authors would like to thank the two anonymous referees for their valuable comments and suggestions which have substantively improved this paper. The research of S.-G. Lv is supported partially by 211 project for the Southwestern University of Finance and Economics, China [Project no. 211QN2011028] as well as Annual Cultivation of the Fundamental Research Funds for the Central Universities [Project no. JBK120940].
[1] C. Cortes, M. Mohri, A. Rostamizadeh, "Generalization bounds for learning kernels," in Proceedings of the 27th International Conference on Machine Learning (ICML '10), pp. 247-254, June 2010.
[2] M. Kloft, U. Brefeld, S. Sonnenburg, A. Zien, " lp -norm multiple kernel learning," Journal of Machine Learning Research , vol. 12, pp. 953-997, 2011.
[3] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan, "Learning the kernel matrix with semidefinite programming," Journal of Machine Learning Research , vol. 5, pp. 27-72, 2004.
[4] C. A. Micchelli, M. Pontil, "Learning the kernel function via regularization," Journal of Machine Learning Research , vol. 6, pp. 1099-1125, 2005.
[5] N. Aronszajn, "Theory of reproducing kernels," Transactions of the American Mathematical Society , vol. 68, pp. 337-404, 1950.
[6] O. Bousquet, A. Elisseeff, "Stability and generalization," Journal of Machine Learning Research , vol. 2, no. 3, pp. 499-526, 2002.
[7] F. Cucker, S. Smale, "On the mathematical foundations of learning," American Mathematical Society , vol. 39, no. 1, pp. 1-49, 2002.
[8] Y. K. Zhu, H. W. Sun, "Consisitency analysis of spectral regularization algorithms," Abstract and Applied Analysis , vol. 2012, 2012.
[9] Y. Ying, D.-X. Zhou, "Learnability of Gaussians with flexible variances," Journal of Machine Learning Research , vol. 8, pp. 249-276, 2007.
[10] Q. Wu, Y. Ying, D.-X. Zhou, "Learning rates of least-square regularized regression," Foundations of Computational Mathematics , vol. 6, no. 2, pp. 171-192, 2006.
[11] T. Hastie, R. Tibshirani, J. Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction , of Springer Series in Statistics, Springer, New York, NY, USA, 2001., 2nd.
[12] Q. Wu, Y. Ying, D.-X. Zhou, "Multi-kernel regularized classifiers," Journal of Complexity , vol. 23, no. 1, pp. 108-134, 2007.
[13] M. Yuan, Y. Lin, "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society B , vol. 68, no. 1, pp. 49-67, 2006.
[14] A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, "More efficiency in multiple kernel learning," in Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 775-782, Corvallis, Ore, USA, June 2007.
[15] M. Kloft, G. Blanchard, "The local rademacher complexity of lp -norm multiple kernel learning," Advances in Neural Information Processing Systems (NIPS '11) , pp. 2438-2446, MIT Press, 2011.
[16] A. W. van der Vaart, J. A. Wellner Weak Convergence and Empirical Processes: With Applications to Statistics , of Springer Series in Statistics, Springer, New York, NY, USA, 1996.
[17] S. Smale, D.-X. Zhou, "Learning theory estimates via integral operators and their approximations," Constructive Approximation , vol. 26, no. 2, pp. 153-172, 2007.
[18] S. Smale, D.-X. Zhou, "Shannon sampling. II. Connections to learning theory," Applied and Computational Harmonic Analysis , vol. 19, no. 3, pp. 285-302, 2005.
[19] A. Micchelli, M. Pontil, Q. Wu, D. X. Zhou, "Error bounds for learning the kernel," Research Note , no. 05-09, University of College, London, UK, 2005.
[20] D.-X. Zhou, "Capacity of reproducing kernel spaces in learning theory," Institute of Electrical and Electronics Engineers , vol. 49, no. 7, pp. 1743-1752, 2003.
[21] H. Sun, Q. Wu, "A note on application of integral operator in learning theory," Applied and Computational Harmonic Analysis , vol. 26, no. 3, pp. 416-421, 2009.
[22] O. Chapelle, J. Weston, B. Sch'olkopf, "Cluster kernels for semi-supervised learning15," in Advances in Neural Information Processing Systems (NIPS '03), pp. 585-592, MIT Press, 2003.
[23] R. Johnson, T. Zhang, "Graph-based semi-supervised learning and spectral kernel design," Institute of Electrical and Electronics Engineers , vol. 54, no. 1, pp. 275-288, 2008.
[24] U. von Luxburg, B. Sch'olkopf, "Statistical learning theory: models, concepts, and results,"
[25] Y. Xu, H. Zhang, "Refinement of reproducing kernels," Journal of Machine Learning Research , vol. 10, pp. 107-140, 2009.
[26] V. Koltchinskii, "Sparsity in penalized empirical risk minimization," Annales de l'Institut Henri Poincaré Probabilités et Statistiques , vol. 45, no. 1, pp. 7-57, 2009.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2012 Shao-Gao Lv and Jin-De Zhu. Shao-Gao Lv et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
The problem of learning the kernel function with linear combinations of multiple kernels has attracted considerable attention recently in machine learning. Specially, by imposing an [superscript]lp[/superscript] -norm penalty on the kernel combination coefficient, multiple kernel learning (MKL) was proved useful and effective for theoretical analysis and practical applications (Kloft et al., 2009, 2011). In this paper, we present a theoretical analysis on the approximation error and learning ability of the [superscript]lp[/superscript] -norm MKL. Our analysis shows explicit learning rates for [superscript]lp[/superscript] -norm MKL and demonstrates some notable advantages compared with traditional kernel-based learning algorithms where the kernel is fixed.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





