Error Bounds for lp -Norm Multiple Kernel

Full text

Turn on search term navigation

(ProQuest: ... denotes non-US-ASCII text omitted.)

Shao-Gao Lv 1 and Jin-De Zhu 2

Recommended by Gerd Teschke

1, Statistics School, Southwestern University of Finance and Economics, Chengdu 611130, China
2, The 2nd Geological Party of Bureau of Geology and Mineral Resources, Henan, Jiaozuo 450000, China

Received 24 April 2012; Accepted 10 July 2012

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

1.1. Overview of Multiple Kernel Learning

Kernel methods such as Support Vector Machines (SVMs) have been extensively applied to supervised learning tasks such as classification and regression. The performance of a kernel machine largely depends on the data representation via the choice of kernel function. Hence, one central issue in kernel methods is the problem of kernel selection; a great many approaches to selecting the right kernel have been studied in the literature [1-4] and other references therein.

We begin with reviewing the classical supervised learning setup. Let (X,d) be a compact metric space and Y⊆... , given a labeled sample z={(_xi ,_yi )^}i=1m ⊆Z:=X×Y , sampled i.i.d . according to an unknown distribution ρ supported on Z , the goal is to estimate a real-valued function _fz depending on the sample, that generalizes well on new and unseen data. A widely used approach to estimate a function from empirical data consists in minimizing a regularization functional in a Hilbert space [Hamiltonian (script capital H)] of real-valued functions: X[arrow right]... . Typically, a regularization scheme estimates f as a minimizer of the functional [figure omitted; refer to PDF] where _...z (f)=(1/m)^∑i=1m ...V(f(_xi ),_yi ) is the empirical risk of hypothesis f , measured by a nonnegative loss function V:...×Y[arrow right]^...+ . In addition, Ω:[Hamiltonian (script capital H)][arrow right]... is a regularizer and λ>0 is a trade-off regularization parameter.

In this paper, we assume that [Hamiltonian (script capital H)] is a reproducing kernel Hilbert space (RKHS) _{[Hamiltonian (script capital H)]K} with kernel K , see [5]. Every kernel K corresponds to a feature mapping _ΨK :X[arrow right]_{[Hamiltonian (script capital H)]K} satisfying K(x,y)=Y9;_ΨK (x),_ΨK (y)_YA;K , and each element of _{[Hamiltonian (script capital H)]K} has the following form: [figure omitted; refer to PDF]

By restricting the regularization to be the form Ω(f)=||f^||K2 , there is a lot of studies from different perspectives such as statistics, optimal recovery and machine learning [6-9], and other references therein. Regularization in an RKHS has a number of attractive features, including the availability of effective error bounds and stability analysis relative to perturbations of the data (see Cucker and Smale [7]; Wu et al. [10]; Bousquet and Elisseeff [6]). Moreover, the optimization problem (1.1) in an RKHS can be reduced to seek for solution in a finite-dimensional space. Although it is simple to prove, this result shows that the variational problem (1.1) can be computational easily.

Because of their simplicity and generality, kernels and associated RKHS play an increasingly important role in Machine Learning, Pattern Recognition and Artificial Intelligence. When the kernel is fixed, an immediate concern is the choice of the regularization parameter λ . This is typically solved by means of cross validation or generalized cross validation [11]. However, the performance of kernel methods critically relies on the choice of the kernel function. A natural question is how to choose the optimal kernel in a collection of candidate kernels.

Kernel learning can range from the width parameter selection of Gaussian kernels [9, 12] to obtaining an optimal linear combination from a set of finite candidate kernels. The latter is often referred to as multiple kernel learning in machine learning and nonparametric group Lasso in statistics [13]. Lanckriet et al. [3] pioneered work on MKL and proposed a semidefinite programming approach to automatically learn a linear combination of candidate kernels for the cases of SVMS. To improve computation efficiency, the multikernel class further is restricted to only convex combinations of kernels [2, 14, 15]. Most learning kernel algorithms are based on considering linear kernel mixtures _Kθ =∑..._θk_Kk ,(_θk ...5;0) with a prescribed kernels _K1 ,...,_KM . For notational simplicity, we will frequently use _Ψk instead of the standard feature _Ψ_Kk . Compared to (1.1), the primal model for learning with multiple kernels is extended to [figure omitted; refer to PDF]

In this paper, we mainly focus on the ^lp -norm MKL, consisting in minimizing the regularized empirical risk with respect to the optimal kernel mixture ^∑k=1M ..._θk_Kk , in addition to ^lp -regularizer on θ to avoid overfitting. This leads to the following optimization problem: [figure omitted; refer to PDF] This scheme was introduced in [2] and the existence of its minimum has been discussed in [4].

The optimization problem subsumes state-of-the-art approaches to multiple kernel learning, covering sparse and nonsparse MKL by arbitrary ^lp -norm regularization ( 1...4;p...4;∞ ) on the mixing coefficients as well as the incorporation of prior knowledge by allowing for nonisotropic regularizer. Kloft et al. [2] developed two efficient interleaved optimization strategies for the ^lp -norm multiple kernel learning, and this interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on real-world problems from computational biology. An analysis of this model, based on Rademacher complexities, was first developed by Cortes et al. [1]. Later improved rates of convergence were derived based on the theory of local Rademacher complexities [15]. However, the estimate on local Rademacher complexities with 1...4;p<2 strictly depends on no-correlation assumption of the M different features, which is too strong condition in theory and practice. In this paper, we employ the notion of empirical covering number to present a theoretical analysis of its generalization error. Besides no-correlation condition is not necessary, empirical covering number is one tight upper bound of local Rademacher complexities [16], also independent of the underlying distribution. We will see that some satisfying learning rates are established when the regularization parameter is appropriately chosen. The interaction between the sample error and the approximation error plays an important role in our analysis, and our new methodology mainly depends on the complexity of hypothesis class measured by empirical covering number and the regularity of a target function.

It should be pointed out that the Tikhonov Regularization in (1.4) has two regularization parameter ( λ , μ ), which may be hard to deal with in practice. Fortunately, an alternative approach has been studied by Rakotomamonjy et al. [14] and Kloft et al. [2]. More precisely, this approach employs the regularizer _||θ||^_lp ...4;1 as an additional constraint into the optimization problem. By substituting _wk for θ_wk , they arrive at the following problem: [figure omitted; refer to PDF]

1.2. Algorithm and Main Consequence

The following Lemma (see [4]) indicates that the above multikernel class can equivalently be represented as a block-norm regularized linear class in the product Hilbert space [Hamiltonian (script capital H)]=_{[Hamiltonian (script capital H)]}_K1 ×...×_{[Hamiltonian (script capital H)]}_KM .

Lemma 1.1.

If p>0 , q=1+(1/p) , and {_aj ,j∈_...n }⊆... , then [figure omitted; refer to PDF] and the equality occurs for _∑j∈_...n ...|_aj |>0 at [figure omitted; refer to PDF]

Hence, Lemma 1.1 can be applied to define the feature mapping: Ψ:x∈X[arrow right](_Ψ1 (x),...,_ΨM (x))∈_{[Hamiltonian (script capital H)]K~} associated with a kernel K~ ; the class of functions defined above coincides with [figure omitted; refer to PDF] when p∈[1,∞] , q∈[1,2] holds from q=2p/(p+1) . The _l2,q -norm is defined here as _||w||2,q =⁽^∑j=1M^...^||^_wj^||^_Kj^q^)1/q . For simplicity, we write ||_fw_||2,q =||w_||2,q . Clearly learning the complexity of (1.8) will be greater than one that is based on a single kernel only, further it provides greater learning ability while the computational complexity increases accordingly. The sample complexity of the above hypothesis space has been studied by Cortes et al. [1] and Kloft and Blanchard [15]. Thus the primal MKL optimization problem (1.5) is equivalent to the following regularization scheme, which is the primary object of investigation in this paper [figure omitted; refer to PDF] Here we use the symbol "min" instead of "inf," since (1.4) is equivalent to (1.9) and the solution of (1.4) exists and is unique. Remark that the above algorithm is a standard regularized empirical risk minimization; this implies that ^lp -norm multiple kernel learning scheme can be free of over-fitting, a phenomenon which occurs when the empirical error is zero but the expected error in far from zero.

In the following, we assume that {_Kj_}j=1,...,M is uniformly bounded, that is, [figure omitted; refer to PDF] Also suppose that each _Kj is continuous. In other words, each _Kj is a Mercer kernel with bound κ ; we refer to [17] for more properties and discussions on Mercer kernel.

In this paper, we only focus on the least square loss: V(f(x),y)=(f(x)-y⁾² . Accordingly, the target function is given by [figure omitted; refer to PDF] where we denote by ρ(·|"x) the conditional distribution of ρ . Through this paper we assume that ρ(·|"x) is supported on [-T,T] , it follows that |_fρ (x)|...4;T for x∈X almost surely. Since the learner _fz may be much larger than _fρ , it is natural to apply a projection operator on _fz , which was introduced into learning algorithms to improve learning rates.

Definition 1.2.

The projection operator π is defined on the space of measurable functions f:X[arrow right]... as [figure omitted; refer to PDF] where T>0 is called the projection level.

The target of error analysis is to understand how π(_fz ) approximates the regression function _fρ . More precisely, we aim to estimate the excess generalization error [figure omitted; refer to PDF] for the ^lp -norm MKL algorithm (1.4), where ...(f)=E(f(x)-y⁾² denotes the expect error of f .

To show some ideas of our error analysis, we first state learning rates of (1.4) in a special case when _fρ ∈_{[Hamiltonian (script capital H)]K~} and K~ is ^C∞ on X⊂^...n .

Theorem 1.3.

Let _fz be defined by (1.9). Assume K~ is ^C∞ on X⊂^...n and _fρ ∈_{[Hamiltonian (script capital H)]K~} . For any 0<δ<1 and ...>0 , with confidence 1-δ , there holds [figure omitted; refer to PDF] where C~ is some constant independent of m or δ .

Theorem 1.3 can be viewed as a corollary of our main result presented in Section 5. It can be arbitrary close to ...AA;(^m-1 ) by choosing ... to be small enough, which is the best convergence rate in learning theory literature.

2. Key Error Analysis

Our main result is about learning rates of (1.4) stated under conditions on the approximation ability of _{[Hamiltonian (script capital H)]K~} with respect to _fρ and capacity of _{[Hamiltonian (script capital H)]K~} .

The approximation ability of the hypothesis space _{[Hamiltonian (script capital H)]K~} with respect to _fρ in the space ^L^_ρX² is reflected by regularization error.

Definition 2.1.

The regularization error of the triple (_{[Hamiltonian (script capital H)]K~} ,_fρ ,_ρX ) is defined as [figure omitted; refer to PDF] We will assume that for some 0<β...4;1 and _Cβ >0 , [figure omitted; refer to PDF]

Remark 2.2.

Our assumption implies that when _fρ is replaced by _{[Hamiltonian (script capital H)]K~} , _...9C;q (λ) tends to zero by polynomial order decay as λ goes to zero. Note [7] that _...9C;q (λ)=o(λ) would imply _fρ =0 . So β=1 in (2.2) is the best we can expect. This case is equivalent to _fρ ∈_{[Hamiltonian (script capital H)]K~} when _{[Hamiltonian (script capital H)]K~} is dense in ^L^_ρX² , see [18]. Assumption (2.2) with 0<β<1 can be characterized in terms of interpolation spaces [7].

If _ρX is the Lebesgue measure on X and the target function _fρ ∈^Hs , a Sobolev space with power s . When Gaussian kernel (_Gσ (x,y)=exp (-σ^||x-y||2 )) is taken with a fixed variance σ , a polynomial decay of _...9C;q (λ) is impossible. However, Example 1 of [19] successfully obtains a polynomial decay under the multikernel setting, allowing for varying variances of Gaussian kernels. This shows that multikernel learning can improve the approximation power and learning ability. More interestingly, we will take a special example to show the impact of the multikernel class on the regularization error in Section 5 below. In particular, a proper multikernel class can be applied to improve the regularization error if the regularity of _fρ is rather high.

Next we define the truncated sample error as [figure omitted; refer to PDF] and the sample error as [figure omitted; refer to PDF]

The function f in the above equation can be arbitrarily chosen; however, only proper choices lead to good estimates of the regularization error. A good choice is f=_fλ where [figure omitted; refer to PDF]

A useful approach for regularization schemes with sample independent hypothesis spaces such as RKHS is an error decomposition, which decomposes the total error ...(π(_fz ))-...(_fρ ) into the sum of the truncated sample error and the regularization error stated as follows.

Proposition 2.3.

Let _fλ be defined by (2.5); there holds [figure omitted; refer to PDF]

Proof.

We can decompose ...(π(_fz ))-...(_fρ ) into [figure omitted; refer to PDF] To bound the second term, by Definition of _fz , _...z (π(_fz ))+λ||_fz^||2,q2 can be bounded by [figure omitted; refer to PDF] since |π(f)(x)-y|...4;|f(x)-y| holds for any function f on Z . The conclusion follows by combining these two inequalities.

3. Estimation on Sample Error

We are in a position to estimate the sample error _...AE;z (λ,_fλ ,T) . The sample error _...AE;z (λ,_fλ ,T) can be written as [figure omitted; refer to PDF]

_...AE;2 can be easily bounded by applying the following one-side Bernstein-type probability inequality.

Lemma 3.1.

Let ξ be a random variable on a probability space Z with variance ^σ2 satisfying |ξ-E(ξ)|...4;_Mξ for some constant _Mξ . Then for any 0<δ<1 , we have [figure omitted; refer to PDF]

Proposition 3.2.

Define the random variable _ξ2 (z)=(_fλ (x)-y⁾² -(_fρ (x)-y⁾² . For every 0<δ<1 , with confidence at least 1-δ/2 , there holds [figure omitted; refer to PDF]

Proof.

From the definition of _...9C;q (λ) , we see that [figure omitted; refer to PDF] Note that _fλ (x)=Y9;_wλ ,Ψ(x)YA; for some _wλ =(^wλ(1) ,...,^wλ(M) )∈_{[Hamiltonian (script capital H)]K~} , by the Cauchy-Schwarz inequality (C.-S.) , we have for any x∈X : [figure omitted; refer to PDF] where (1/q)+(1/^q* )=1 . Using Assumption (1.10), it follows that [figure omitted; refer to PDF]

Observe that [figure omitted; refer to PDF] Since that |_fρ (x)|...4;T almost surely, we have [figure omitted; refer to PDF] Hence |_ξ2 -E(_ξ2 )|...4;_M_ξ2 :=2c . Moreover, E(_ξ2⁾² equals [figure omitted; refer to PDF] which implies that ^σ2 (_ξ2 )...4;E(^ξ22 )...4;c_...9C;q (λ) . The desired result follows from Lemma 3.1.

Next we estimate the first term _...AE;1 . It is more difficult to deal with because it involves a set of random variables _fz varying with z , requiring to consider the functional complexity. For this purpose, we introduce the notion of empirical covering numbers, which often lead to sharp error estimates [16].

Definition 3.3.

Let (U,d) be a pseudometric space and S⊂U . For every ...>0 , the covering number ...A9;(S,...,d) of S with respect to ... and d is defined as the minimal number of balls of radius ... whose union covers S , that is, [figure omitted; refer to PDF] where B(_sj ,...)={s∈U:d(s,_sj )...4;...} is a ball in U .

The ^l2 -empirical covering number of a function set is defined by means of the normalized ^l2 -metric _d2 on the Euclidian space ^...k given by [figure omitted; refer to PDF]

Definition 3.4.

Let ... be a set of function on X , x=(_xi^)i=1k ⊂^Xk and ..._|x ={(f(_xi )^)i=1k :f∈...}⊂^...k . Set _...A9;2,x (...,...)=...A9;(..._|x ,...,_d2 ) . The ^l2 -empirical covering number of ... is defined by [figure omitted; refer to PDF]

Denote by _BR the ball of radius R with R>0 , _BR ={f∈_{[Hamiltonian (script capital H)]K~} : ||f_||K~ ...4;R} . We need the following capacity assumption on _{[Hamiltonian (script capital H)]K~} .

Assumption 3.5.

There exists an exponent [upsilon] , with 0<[upsilon]<2 and a constant _{C[upsilon],K~} >0 such that [figure omitted; refer to PDF] where _B1 is the unit ball of _{[Hamiltonian (script capital H)]K~} defined as above.

For any function _fw =_{Y9;w,Ψ(x)YA;}_{[Hamiltonian (script capital H)]K~} , by the hölder inequality, we have [figure omitted; refer to PDF] and it follows from (3.13) [figure omitted; refer to PDF] where ^B1q is called the generalized unit ball of _{[Hamiltonian (script capital H)]K~} associated with q , defined by [figure omitted; refer to PDF]

Note that for any function set ...⊆C(X) , the empirical covering number _...A9;2,x (...,...) is bounded by ...A9;(...,...) , the (uniform) covering number of ... under the metric ||·_||∞ , since _d2 (f,g)...4;||f-g_||∞ . It was shown in [20] that the quantity log (...A9;(_B1 ,...))...4;_C0 (1/...^)s holds for some _C0 >0 if K is ^C2n/s on a subset of ^...n , hence log (_...A9;2 (_B1 ,...))...4;_C0 (1/...^)s also holds. In particular, s is arbitrarily small for a ^C∞ kernel ( such as Gaussian kernel). Now we give a concrete example in ^...n to reveal relationship between the regularity of function class and its corresponding empirical covering number.

Example 3.6.

Let X be a bounded domain in ^...n and ^Hs the Sobolev space of index s . When s>n , the classical Embedding Theorem tells us that ^Hs (X) is an RKHS and its unit ball _B1 is embedded in a finite ball of the function space ^Cs-(n/2)-ζ (X) with inclusion bounded where 0<ζ<s-n . From the classical bounds for covering numbers of the unit ball of ^Cs-(n/2)-ζ (X) , we see that [figure omitted; refer to PDF] Hence Assumption (3.13) below holds with [upsilon]=n/(s-n/2-ζ)<2 .

Our concentration estimate for the sample error dealing with _...AE;1 is based on the following concentration inequality, which can be found in [12].

Lemma 3.7.

Let ... be a class of measurable functions on Z . Assume that there are constants B,c>0 and η∈[0,1] such that ||f_||∞ ...4;B and E^f2 ...4;c(Ef^)η for every f∈... . If for some a>0 and [upsilon]∈(0,2) , [figure omitted; refer to PDF] then there exists a constant _c[upsilon] such that for any t>0 , with confidence at least 1-^e-t , there holds [figure omitted; refer to PDF] where [figure omitted; refer to PDF]

Denote the set of function ^...Rq with R>0 , where [figure omitted; refer to PDF]

Proposition 3.8.

If ^B1q satisfies the capacity condition (3.13) with some 0<[upsilon]<2 , then for any δ∈(0,1) , with confidence 1-δ/2 , there holds [figure omitted; refer to PDF] with constant _C[upsilon],T =4_c[upsilon]⁽^{_{C[upsilon],K~}}^T2^{)[upsilon]/(2+[upsilon])}^{M2[upsilon]/(2+[upsilon])}^q* .

Proof.

Consider the set ^...Rq . Each function g∈^...Rq can be expressed as g(z) = (y-π(f)(x)⁾² -(y-_fρ (x)⁾² with some f∈^BRq . Then Eg=...(π(f))-...(_fρ )=||π(f)-_fρ^||ρ2 and (1/m)^∑i=1m ...g(_zi ) = _...z (π(f))-_...z (_fρ ) . Note that [figure omitted; refer to PDF] Since |π(f)(x)|...4;T and |_fρ (x)|...4;T for any x∈X , we see that for any z∈Z , [figure omitted; refer to PDF] On the other hand, for any _g1 ,_g2 ∈^...Rq at point z=(x,y) , we have [figure omitted; refer to PDF] since the projector operator π is a contractive map. It follows that [figure omitted; refer to PDF] It follows from the capacity condition (3.15) [figure omitted; refer to PDF]

Applying Lemma 3.7 with B=c=4^T2 , η=1 , and a=_{C[upsilon],K~} (4TR^M1/^q*^)[upsilon] , we see that for any δ∈(0,1) , with confidence 1-δ/2 , there holds [figure omitted; refer to PDF] Besides, following the definition of _fz (1.9), we have [figure omitted; refer to PDF] that is, ||_fz_||2,q ...4;T/λ . Hence we can replace R with T/λ . Thus we derive our desired result.

4. Total Learning Rates

We are now in a position to obtain the learning rates of projected algorithm (1.9). Main results of this paper will be presented in Theorem 4.1.

Following the error decomposition scheme in Proposition 2.3 and combining Propositions 3.2 and 3.8, we derive the following bounds on the total error.

Theorem 4.1.

Suppose that ^B1q satisfies the capacity condition (3.13) with some 0<[upsilon]<2 , and _...9C;q (λ)...4;_Cβ^λβ . For any δ∈(0,1) , with confidence 1-δ , there holds [figure omitted; refer to PDF] where C[variant prime]=3^κ2^M2/^q* +86^T2 +1+_C[upsilon],T^{2[upsilon]/(2+[upsilon])} +_Cβ and _C[upsilon],T is defined as in Proposition 3.8.

Proof.

Following Propositions 3.2 and 3.8, with confidence at least 1-δ , (1/2)||π(_fz )-_fρ^||ρ2 can be bounded by [figure omitted; refer to PDF] Firstly, we set [figure omitted; refer to PDF] which implies that λ=(1/m) . On the other hand, from the assumption _...9C;q (λ)...4;_Cβ^λβ , we set [figure omitted; refer to PDF] Hence our assertion follows by taking λ=^{(1/m)min {2/(2β+(1+β)[upsilon]),1}} .

Proof of Theorem 1.3.

When K~∈^C∞ , it follows that condition (3.13) holds for arbitrary small [upsilon]>0 . Moreover, _fρ ∈_{[Hamiltonian (script capital H)]K~} implies that β=1 , the conclusion follows easily from Theorem 4.1.

Our learning rates below in Corollary 4.2 will be achieved under the regularity assumption on the regression function that _fρ lies in the range of ^LK~r for some r>0 . Given any kernel K , _LK is the integral operator on ^L^_ρX² defined by [figure omitted; refer to PDF] The operator _LK is linear, compact, positive and can be also regarded as a self-adjoint operator on _{[Hamiltonian (script capital H)]K} . Hence the fractional power operator ^LKr :^L^_ρX² [arrow right]^L^_ρX² is well defined and is given by [figure omitted; refer to PDF] where {_λk_}k are eigenvalues of the operator _LK arranged in a decreasing order and {_{[straight phi]k}_}k are the corresponding eigenfunctions, which form an orthonormal basis of ^L^_ρX² . In fact, the image of ^LKr is contained in _{[Hamiltonian (script capital H)]K} if r...5;1/2 . So ^LK-r_fρ ∈^L^_ρX² indicates that _fρ lies in the range of ^LKr , measuring the regularity of the regression function.

Corollary 4.2.

Suppose that ^B1q satisfies the capacity condition (3.13) with some 0<[upsilon]<2 , and ^LK~-r_fρ ∈^L^_ρX² ( 0<r...4;1 ). For any 0<δ<1 , with confidence 1-δ , there holds [figure omitted; refer to PDF] where C~ is some constant independent of m or δ .

Proof.

Recall a result from [18], if ^LK~-r_fρ ∈^L^_ρX² ( 0<r...4;1/2 ), there holds [figure omitted; refer to PDF] If r...5;1/2 , this shows that _fρ ∈_{[Hamiltonian (script capital H)]K~} as mentioned above, then we have _fλ =_fρ and [figure omitted; refer to PDF] Hence for any r∈[0,1] , we have [figure omitted; refer to PDF]

On the other hand, observe Lemma 5.1 below, and we have [figure omitted; refer to PDF] where ^C* is some constant and q∈[1,2] . The conclusion follows immediately from Theorem 4.1.

Let us compare our learning rates with the existing results.

In [10], a uniform covering number technique was used to derive the expected value of learning schemes (1.1) where Ω(_fw )=||w^||K2 . If all the kernels {_Ki } are the same with some specialized K and ^LK-r_fρ ∈^L^_ρX² for some 0<r...4;1/2 . For any 0<ζ<1/(1+[upsilon]) and any 0<δ<1 , with confidence 1-δ , then [figure omitted; refer to PDF] Clearly the learning rates derived from Corollary 4.2 are better than that in [10] since 2rζ<2r/(1+[upsilon])...4;4r/(4r+(1+2r)[upsilon]) .

In [21], an operator monotonic technique was used to improve the kernel independent error bounds in comparison with the result in [17]. If ^LK-r_fρ ∈^L^_ρX² for some 0<r...4;1 . For any 0<δ<1 , with confidence 1-δ , there holds [figure omitted; refer to PDF]

When r...5;1/2 and [upsilon]...4;(2-r)/3r , the learning rate given by Corollary 4.2 is better than the above result.

As for empirical risk minimization (ERM), classical results on analysis of ERM schemes give error bounds between the empirical target function and the regression function. In particular, learning rates of type ^{m-[straight epsilon]} with [straight epsilon] arbitrarily close to 1 can be achieved by ERM schemes (see [15]). However, the ERM setting is different from the one on Tikhonov regularization. How to choose the regularization parameter λ=λ(m) , depending on the sample size m , is the essential difficulty for the regularization scheme, even when _fρ lies in _{[Hamiltonian (script capital H)]K} . On the other hand, it is obvious that our result is more general than that of [15] since the case for _fρ ∉_{[Hamiltonian (script capital H)]K} ( r<1/2 ) is also covered.

5. Discussion on Regularization Error

By our assumptions on M different kernels {_Kj^}j=1M , we see that _{[Hamiltonian (script capital H)]K~} is an RKHS generated by the Mercer kernel K~ . There are several standard approximation results on regularization error _...9C;2 (λ) in learning theory (see [17]). Next we establish a tight connection between _...9C;2 (λ) and _...9C;q (λ) with 1...4;q...4;2 .

Lemma 5.1.

Let _{[Hamiltonian (script capital H)]K~} be a separable RKHS over X associated with a bounded measurable kernel, ρ be a distribution on X×[-T,T] , and 1...4;q...4;2 . If there exist constants C>0 and β>0 such that _...9C;2 (λ)...4;C^λβ , then for all λ>0 we have [figure omitted; refer to PDF]

Proof.

If there exists a function _f~λ satisfying [figure omitted; refer to PDF] we see that λ||_f~λ^||2,22 ...4;C^λβ . We write _f~λ (x)=Y9;w~,Ψ(x)YA; with some w~=(^w~(1) ,...,^w~(M) ) , by the Ho¨lder inequality, we have [figure omitted; refer to PDF] Then we obtain [figure omitted; refer to PDF]

In other words, if _...9C;2 (λ) has a polynomial behavior in λ , then this behavior completely determines the behavior of all _...9C;q (λ) . Thus it suffices to assume that the standard 2 -approximation error function satisfies (2.2).

From statistical effective dimension point of view, we will discuss the impact of the multikernel class _{[Hamiltonian (script capital H)]K~} on the approximation error ||_fλ -_fρ_||^_L^_ρX^₂ . To estimate this error, note that the regularizing function of _...9C;2 (λ) exists, is unique, and given by [7] [figure omitted; refer to PDF] For simplicity, let M=2 and take a Mercer kernel _Ko as the original one, by the classical Mercer theorem, _Ko can be expressed as _Ko (x,y)=^∑k=1∞ ..._λk_{[straight phi]k} (x)_{[straight phi]k} (y) . Another kernel we take is ^KoN =^∑k=1∞ ...^λkN_{[straight phi]k} (x)_{[straight phi]k} (y) ( 2...4;N∈^...+ ). In this case, K~=_Ko +^KoN =^∑k=1∞ ...(_λk +^λkN )_{[straight phi]k} (x)_{[straight phi]k} (y) . By the fact that _fλ -_fρ =λ(λI+_LK~^)-1_fρ and assumption ^L^_Ko^-r_fρ ∈^L^_ρX² , we have [figure omitted; refer to PDF]

Let us compare the multikernel class regularization with Tikhonov regularization in _{[Hamiltonian (script capital H)]}_Ko when the Mercer kernel _Ko is employed. Denote the saturation index as the maximal r so that the approximation error achieves fastest decay rate under the condition ^L^_Ko^-r_fρ ∈^L^_ρX² . Then (5.6) shows the saturation index for multikernel class regularization is N while it is 1 for Tikhonov regularization in _{[Hamiltonian (script capital H)]}_Ko , as shown in [17].

In this case, our analysis implies that we should use an alternative kernel with faster eigenvalue decay when the spectral coefficients of the target function decay faster: for example, using ^KoN instead of _Ko . This has a dimension reduction effect. Essentially, we effectively project the data into the principal components of data. The intuition is also quite clear: if the dimension of the target function is small (spectral coefficient r decays fast), then we should project data to those dimensions by reducing the remaining noisy dimensions (corresponding to fast kernel eigenvalue decay N ). In fact, the similar idea under the framework of semisupervised learning has been shown in spectral kernel design methods [22, 23].

In general, for the sample error, there exist rates of convergence which hold independently of the underlying distribution ρ . This is important, as it tells us that we can give convergence guarantees no matter what a kernel is used, even we do not know the underlying distribution. In fact, this is very common in statistical analysis of various machine learning algorithms (see [24]). This decay is usually fast enough for practical use where amounts of samples are available. For the approximation error, however, it is impossible to give rates of convergence which hold for all probability distributions ρ . Hence what determines the learning accuracy is the approximation error. In kernel regression setup, this is determined by the choice of the kernel and enhances the importance of learning kernels [4] and constructing refined kernels [25].

6. Further Study

In the last section, we exclusively discuss sparsity in the case of the square loss regularization functional in (1.1) with the regularizer Ω(f)=||f^||K2 in RKHS. We can derive the explicit expression for this functional from [4], in turn which provides improvement and simplification of our algorithm (1.4).

Lemma 6.1.

For any kernel K and positive constant λ , we have that [figure omitted; refer to PDF] where the vector y=(_yi ,...,_ym^)T and K(x) denotes the m×m Gram matrix (K(_xi ,_xj ):i,j=1,...,m) and Y9;·YA; denotes the standard inner product in Euclidean space.

According to Lemma 6.1, the least square algorithm of (1.4) can be rewritten as a one-layer minimization problem [figure omitted; refer to PDF]

We assume that _fρ ∈_{[Hamiltonian (script capital H)]}_θ0 for some _θ0 ∈^...M . Define _Jθ :=supp (θ)={k:_θk ...0;0} . We say that _fρ is sparse relative to _θ0 if the cardinally of _J_θ0 is far smaller than M .

For p∈[1,_pM ] (but _pM is close to 1), it would be interesting to show that the solution _θz of (6.2) is approximately sparse following the path of _θ0 . In some sense, _∑k∉_J_θ0 ...(_θz_)k is very small with a very high probability. A refined analysis of ^lp -regularized methods was done by Koltchinskii [26] in the case of combination of M basis functions, mainly taking into account the soft sparsity pattern of the Bayes function and establishing several oracle inequalities in statistical sense. Extending the ideas into the kernels learning setting would be of a great significance, because it can provide theoretical support showing that the ^lp -norm MKL can automatically select good kernels, which coincide with the underlying right kernels.

Acknowledgments

The authors would like to thank the two anonymous referees for their valuable comments and suggestions which have substantively improved this paper. The research of S.-G. Lv is supported partially by 211 project for the Southwestern University of Finance and Economics, China [Project no. 211QN2011028] as well as Annual Cultivation of the Fundamental Research Funds for the Central Universities [Project no. JBK120940].

References

[1] C. Cortes, M. Mohri, A. Rostamizadeh, "Generalization bounds for learning kernels," in Proceedings of the 27th International Conference on Machine Learning (ICML '10), pp. 247-254, June 2010.

[2] M. Kloft, U. Brefeld, S. Sonnenburg, A. Zien, " ^lp -norm multiple kernel learning," Journal of Machine Learning Research , vol. 12, pp. 953-997, 2011.

[3] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan, "Learning the kernel matrix with semidefinite programming," Journal of Machine Learning Research , vol. 5, pp. 27-72, 2004.

[4] C. A. Micchelli, M. Pontil, "Learning the kernel function via regularization," Journal of Machine Learning Research , vol. 6, pp. 1099-1125, 2005.

[5] N. Aronszajn, "Theory of reproducing kernels," Transactions of the American Mathematical Society , vol. 68, pp. 337-404, 1950.

[6] O. Bousquet, A. Elisseeff, "Stability and generalization," Journal of Machine Learning Research , vol. 2, no. 3, pp. 499-526, 2002.

[7] F. Cucker, S. Smale, "On the mathematical foundations of learning," American Mathematical Society , vol. 39, no. 1, pp. 1-49, 2002.

[8] Y. K. Zhu, H. W. Sun, "Consisitency analysis of spectral regularization algorithms," Abstract and Applied Analysis , vol. 2012, 2012.

[9] Y. Ying, D.-X. Zhou, "Learnability of Gaussians with flexible variances," Journal of Machine Learning Research , vol. 8, pp. 249-276, 2007.

[10] Q. Wu, Y. Ying, D.-X. Zhou, "Learning rates of least-square regularized regression," Foundations of Computational Mathematics , vol. 6, no. 2, pp. 171-192, 2006.

[11] T. Hastie, R. Tibshirani, J. Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction , of Springer Series in Statistics, Springer, New York, NY, USA, 2001., 2nd.

[12] Q. Wu, Y. Ying, D.-X. Zhou, "Multi-kernel regularized classifiers," Journal of Complexity , vol. 23, no. 1, pp. 108-134, 2007.

[13] M. Yuan, Y. Lin, "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society B , vol. 68, no. 1, pp. 49-67, 2006.

[14] A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, "More efficiency in multiple kernel learning," in Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 775-782, Corvallis, Ore, USA, June 2007.

[15] M. Kloft, G. Blanchard, "The local rademacher complexity of ^lp -norm multiple kernel learning," Advances in Neural Information Processing Systems (NIPS '11) , pp. 2438-2446, MIT Press, 2011.

[16] A. W. van der Vaart, J. A. Wellner Weak Convergence and Empirical Processes: With Applications to Statistics , of Springer Series in Statistics, Springer, New York, NY, USA, 1996.

[17] S. Smale, D.-X. Zhou, "Learning theory estimates via integral operators and their approximations," Constructive Approximation , vol. 26, no. 2, pp. 153-172, 2007.

[18] S. Smale, D.-X. Zhou, "Shannon sampling. II. Connections to learning theory," Applied and Computational Harmonic Analysis , vol. 19, no. 3, pp. 285-302, 2005.

[19] A. Micchelli, M. Pontil, Q. Wu, D. X. Zhou, "Error bounds for learning the kernel," Research Note , no. 05-09, University of College, London, UK, 2005.

[20] D.-X. Zhou, "Capacity of reproducing kernel spaces in learning theory," Institute of Electrical and Electronics Engineers , vol. 49, no. 7, pp. 1743-1752, 2003.

[21] H. Sun, Q. Wu, "A note on application of integral operator in learning theory," Applied and Computational Harmonic Analysis , vol. 26, no. 3, pp. 416-421, 2009.

[22] O. Chapelle, J. Weston, B. Sch'olkopf, "Cluster kernels for semi-supervised learning15," in Advances in Neural Information Processing Systems (NIPS '03), pp. 585-592, MIT Press, 2003.

[23] R. Johnson, T. Zhang, "Graph-based semi-supervised learning and spectral kernel design," Institute of Electrical and Electronics Engineers , vol. 54, no. 1, pp. 275-288, 2008.

[24] U. von Luxburg, B. Sch'olkopf, "Statistical learning theory: models, concepts, and results,"

[25] Y. Xu, H. Zhang, "Refinement of reproducing kernels," Journal of Machine Learning Research , vol. 10, pp. 107-140, 2009.

[26] V. Koltchinskii, "Sparsity in penalized empirical risk minimization," Annales de l'Institut Henri Poincaré Probabilités et Statistiques , vol. 45, no. 1, pp. 7-57, 2009.

Word count: 5660

Show less

Copyright © 2012 Shao-Gao Lv and Jin-De Zhu. Shao-Gao Lv et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Translate

The problem of learning the kernel function with linear combinations of multiple kernels has attracted considerable attention recently in machine learning. Specially, by imposing an [superscript]lp[/superscript] -norm penalty on the kernel combination coefficient, multiple kernel learning (MKL) was proved useful and effective for theoretical analysis and practical applications (Kloft et al., 2009, 2011). In this paper, we present a theoretical analysis on the approximation error and learning ability of the [superscript]lp[/superscript] -norm MKL. Our analysis shows explicit learning rates for [superscript]lp[/superscript] -norm MKL and demonstrates some notable advantages compared with traditional kernel-based learning algorithms where the kernel is fixed.

Details

Title

Error Bounds for lp -Norm Multiple Kernel Learning with Least Square Loss

Author

Shao-Gao, Lv; Jin-De, Zhu

Publication year

2012

Publication date

2012

Publisher

John Wiley & Sons, Inc.

ISSN

10853375

e-ISSN

16870409

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2012/915920

ProQuest document ID

1038344020

Error Bounds for lp -Norm Multiple Kernel Learning with Least Square Loss

Jump to:

Full text

Abstract

Details

Suggested sources