LEARNING RATES OF KERNEL-BASED ROBUST CLASSIFICATION*

2022-06-25 02:13ShuhuaWANG王淑华

Shuhua WANG (王淑华)

School of Information Engineering,Jingdezhen Ceramic University,Jingdezhen 333403,China

E-mail:w614sh@126.com

Baohuai SHENG (盛宝怀)†

Department of Finance,Zhejiang Yuexiu University,Shaoxing 312030,China Department of Applied Statistics,Shaoxing University,Shaoxing 312000,China

E-mail:shengbaohuai@163.com

Abstract This paper considers a robust kernel regularized classification algorithm with a non-convex loss function which is proposed to alleviate the performance deterioration caused by the outliers.A comparison relationship between the excess misclassification error and the excess generalization error is provided;from this,along with the convex analysis theory,a kind of learning rate is derived.The results show that the performance of the classifier is effected by the outliers,and the extent of impact can be controlled by choosing the homotopy parameters properly.

Key words Support vector machine;robust classification;quasiconvex loss function;learning rate;right-sided directional derivative

1 Introduction

The problem of classification is one of the most considered in learning theory.The goal of classification is to construct a classifier with a small misclassification error.Of the many classical learning methods that have been provided,the support vector machine (SVM) has been one of the most successful learning algorithms (see for example[1–6]).

LetXbe a given compact set in thed-dimensional Euclidean space Rd(input space),and letY={-1,1}(representing the two classes).Letρbe a fixed but unknown probability distribution onZ=X×Ywhich yields its marginal distributionρXonXand its conditional distributionρ(·|x)=P (·|x) atx∈X.Then we denote bythe samples i.i.d.(independently and identically drawn) according to the distributionρ.

As we know,the goal of classification is to produce a binary classifier C:X→Y.The misclassification error of C,which is used to measure the prediction power of C,is defined by

The best classifier,called the Bayes rule,is given by

The classifiers considered in this paper are induced by real-valued functionsf:X→R as Cf=sgn (f),which is defined by sgn (f)=1 iff(x)≥0 and sgn (f)=-1 otherwise.Recall that the regression ofρis

It is easy to see thatfc(x)=sgn (fρ)(x).

The Tikhonov regularization technique is an effective method for solving ill-posed problems ([7–10]).In recent years,many classification algorithms generated from Tikhonov regularization schemes have been developed and the convergencerates have been investigated (see,for example,[11–17]and references therein).We now briefly introduce the general framework of the kernel regularization classification algorithm.

Given a classifying loss functionV(r):R→R+,the generalization error is defined as

where F is the set of all of the measurable functions onX.

We call HKa reproducing kernel Hilbert space (RKHS) associated with a Mercer kernelK(x,y)(see for example[17–23]) if,for allf∈HKandx∈X,it holds that

The general kernel regularized classification associated with a given reproducing kernel Hilbert space HKtakes the form

whereλ>0 is the regularization parameter.

Based on model (1.1),a large number of scholars have studied the kernel regularization classification algorithms with some convex loss functions (see,for example,[11–14,16,17,19]).However,most of these research results do not consider the impact of outliers on the performance of the algorithms.It is known that,in many practical applications,outliers often occur in real data sets.Outliers can be described as data points that are far away from the pattern set by the majority of the data (see[24–26]).Outliers usually have a large impact on non-robust methods and can even make the whole statistical analysis useless if a non-robust loss is used (see,for example,[27,28]).

In the literature of learning theory,a great deal of effort has been made to improve the robustness of the SVM and other similar learning algorithms.Robust loss functions for SV classification have been introduced and have been intensively studied (see,for example,[27–30]).Unfortunately,robustness comes with a cost:when minimizing a convex loss,even a single data point can dominate the result;that is,any convex loss function exhibits necessarily unbounded sensitivity to even a single outlier.The classical literature achieves bounded outlier sensitivity by considering redescending loss functions or bounded loss functions[27],both of which are inherently non-convex.As observed,non-convex loss functions are necessary to alleviate the effect of outliers.Along these lines,S.Suzumura,K.Ogawa,et al.[31]introduced a novel homotopy approach for the non-convex robust SVM learning.The basic idea is to introduce parameters,which bridge the standard SVM and fully robust SVM to express the influence of outliers.The loss functionVθ,s(r):R→[0,∞) that they defined is

where[1-r]+:=max{0,1-r},θ∈[0,1] ands≤0.

It can be seen thatVθ,s(r) is characterized by a pair of parameters.We refer toθandsas the tuning parameters which govern the influence of outliers.Figure 1 shows the loss functions for severalθands.

Figure 1 The robust classification loss function with different values of θ and s

The robust SV classification algorithm defined in[31]is

wheref(x)=w⊤φ(x)(withφ(x) being the feature map implicitly defined by a kernelK),wis a vector in the feature space,and⊤denotes the transpose of vectors and matrices.C>0 is the penalty parameter.

Algorithms and experiments in[31]illustrate the significance of this homotopy approach.The purpose of this paper is to analyze the convergence rate of the excess misclassification error associated with the non-convex loss functionVθ,s(r) defined as in (1.2),and to understand how the choice of the homotopy parameters affect the error.The main difference between our work and reference[31]is that the error bound of the algorithm is estimated theoretically,while reference[31]mainly studies the solution method of the model from the perspective of numerical calculation.

The convex analysis method has become an effective error analysis method in the theoretical research field of machine learning (see,for example,[32–39]).Since the loss functionVθ,sis a non-convex function,the usual convex method cannot be used.In[39],the convergence rate of kernel regularized robust SV regression is provided by using an improved convex analysis method.Differently to the regression problem,the performance of classification algorithm is usually reflected by the convergence rate of the excess misclassification error rather than the excess generalization error considered in[39].Thus,the convergence analysis of classification algorithms is more difficult than that of regression algorithms.In order to overcome this difficulty,we improve the method in reference[39]by the technique of a Fenchel-Legendre conjugate,and an important comparison inequality is derived.

The rest of this paper is organized as follows:in Section 2,the kernel regularized robust SV classification model considered in this paper is given.Section 3 presents the main results of the paper,where a comparison inequality and the learning rate are provided.Some proofs for the main results are given in Section 4.

2 The Kernel Regularized Robust SV Classification

In this section,a regularized robust SV classification model is provided.

With the loss functionVθ,s(r) given by (1.2),we define the generalization error off:X→R as

Also,the general kernel regularized classification associated with the reproducing kernel Hilbert space HKtakes the form

In this paper,we consider a partition of the samplesinto two disjoint sets I and O.The samples in I and O are defined as Inliers and Outliers,respectively.We impose the restriction that the marginyif(xi) of an Inlier should be larger than or equal tos,while that of an Outlier should be smaller thans.In other words,

where M:={1,...,m}.Let P={I,O}∈2mdenote the partition,where 2mis the power set of

With these notions in hand,the kernel regularized homotopy algorithm is defined as (see[31])

where|I|sand|O|sare the cardinality of I and O,respectively,and|I|s+|O|s=m.

Take

We know,from the analysis process of[31]and[31,Proposition 2.1],that(f) is a convex function on HK.Then,(2.2) can be rewritten as

Note that whenθ=1 ors=-∞,algorithm (2.3) is equivalent to the standard formulation of the SV classification with the hinge lossV(r)=[1-r]+(see[11]).

Our goal is to investigate the excess misclassification error for algorithm (2.3).In the present work,the sample error estimate for algorithm (2.3) is investigated with an improved convex approach and a capacity independent error is provided.The excess misclassification error will be shown by a K-functional associated with model (2.1).

3 Error Analysis

3.1 The comparison inequality

We need to bound the excess misclassification errorto measure the performance of algorithm (2.3).The algorithm is designed by minimizing a penalized empirical error associated with the loss functionVθ,s,so relations between the excess misclassification error and the excess generalization errorbecome crucial.A systematic study of this problem is done in[37].

In[11],the following important results are derived:

Lemma 3.1([11]) LetVbe a classifying loss satisfyingV′′(0)>0.Then there exists a constantcV>0 such that for any measurable functionf:X→R,it holds that

LetV(r)=[1-r]+.Then for any measurable functionf:X→R,it holds that

In this subsection we shall give a comparison inequality forVθ,s(r).

Denoteη=η(x)=P (Y=1|X=x) and Φη(r)=ηVθ,s(r)+(1-η)Vθ,s(-r).Then it holds that

Forη∈[0,1],we define the optimal conditionalVθ,s-error as

Then the optimalVθ,s-error satisfies

We definer*:[0,1]→R as

It is easy to see that,for any nonnegative loss functionV,H-(η)≥H(η) holds.

Definition 3.2([37]) We say that the loss functionVis classification-calibrated if,for any,it holds thatH-(η)>H(η).

Definition 3.3([37]) Given a loss functionV:R→[0,∞),we define the Ψ-transform function Ψ:[-1,1]→[0,∞) by,where

andg**:[-1,1]→R is the Fenchel-Legendre biconjugate ofg:[-1,1]→R,which is characterized by epig**=epig.HereSis the closure of the convex hull of the setS,and epigis the epigraph of the functiong,that is,the set{(x,t):x∈[0,1],g(x)≤t}.

Lemma 3.4([37]) For any nonnegative loss functionV,classification-calibration implies that Ψ is invertible on[0,1],and that the following statements are equivalent:

(a)Vis classification-calibrated;

(b)Ψ(τ)>0 for allτ∈(0,1].

For the loss functionVθ,s(r) considered in this paper,we have the following result:

Lemma 3.5LetVθ,s(r) be the loss function defined as (1.2),τ∈[-1,1].Fors≤-1,we have

and for-1<s≤0,we have

wheredθ,s=2θ+(1-θ)(1-s).

Proof(1) Fors≤-1,we discussH(η) andH-(η).We consider the form of Φη(r) which is a convex combination ofVθ,s(r) andVθ,s(-r).

Forη=0,Φη(r)=Vθ,s(-r),eachr≤-1 makes Φη(r)=0.Similarly,forη=1,Φη(r)=Vθ,s(r),eachr≥1 makes Φη(r)=0.For 0<η<,Φη(r) decreases strictly on (-∞,-1]and increases on[-1,+∞).For<η<1,Φη(r) decreases strictly on (-∞,1]and increases on[1,+∞).Forη=,Φη(r) decreases strictly on (-∞,-1]and increases on[1,+∞),and Φη(r)≡1 on[-1,1].

Therefore,by the definition ofr*(η),we have

This implies thatr*(η)=sgn (η-) for allη∈[0,1]and

By the definition ofH(η),we have

On the other hand,we need to derive the concrete expression ofH-(η).According to (3.2),we combine the monotonicity of Φηdiscussed above with the constraint conditionr(2η-1)≤0.The detailed derivation of this is given below.

Forη=0,Φη(r)=Vθ,s(-r).Since 2η-1=-1<0,the constraint condition implies thatr≥0.Φη(r) attains the minimum value atr=0,and the minimum value is Φη(0)=Vθ,s(0)=1.Similarly,forη=1,Φη(r)=Vθ,s(r).Since 2η-1=1>0,the constraint condition implies thatr≤0.Φη(r) attains the minimum value atr=0,and the minimum value is Φη(0)=Vθ,s(0)=1.,2η-1≤0,the constraint condition implies thatr≥0.Since Φη(r) increases on[0,+∞),Φη(r) attains the minimum value atr=0,and the minimum value is Φη(0)=ηVθ,s(0)+(1-η)Vθ,s(0)=1.,the constraint condition implies thatr≤0.Φη(r) decreases strictly on (-∞,0],Φη(r) attains the minimum value atr=0,and the minimum value is Φη(0)=ηVθ,s(0)+(1-η)Vθ,s(0)=1.

Thus,for any 0≤η≤1,we have

(2) For-1<s≤0,we discussH(η) andH-(η).

Forη=0,Φη(r)=Vθ,s(-r),it is easy to see that Φη(r)=0 for eachr≤-1.Similarly,forη=1,Φη(r)=Vθ,s(r),eachr≥1 makes Φη(r)=0.For,Φη(r) decreases on (-∞,-1]∪[-s,1]and increases on[-1,-s]∪[1,+∞).

By a simple calculation,we get that

Then it follows immediately from 0<η<that Φη(-1)<Φη(1).Therefore the minimum value must be attained at-1.

According to the above analysis,we have that

Hence,for any 0≤η≤1,

Letdθ,s=2θ+(1-θ)(1-s).Then

Now we are in a position to discussH-(η).

Forη=0 orη=1,in a fashion similar to the case ofs≤-1,we have that Φη(r) attains the minimum value Φη(0)=Vθ,s(0)=1.For,2η-1≤0,the constraint condition implies thatr≥0.Φη(r) increases on[0,-s]∪[1,+∞),and decreases on[-s,1],so Φη(r) has two local minimum values:Φη(0)=1 and Φη(1)=(1-η)Vθ,s(-1)=dθ,s(1-η).Ifthendθ,s(1-η)≥1,otherwisedθ,s(1-η)<1.It can be concluded that

Combining (3.10) and (3.11),for anyη∈[0,1],we have that

By (3.9) and (3.12),we get that

and by using the facts that we have,a,b∈R,it follows that

This completes the proof. □

For the loss functionVθ,s(r) defined as (1.2),Figures 2 and 3 show the graphs of Φη(r) with some values of homotopy parameters,while the graphs of Ψ(τ) andH(η) are shown in Figure 4.

Figure 2 Φη(r) with θ=0.25.s=-1.5 in the left sub figure,s=-1 in the middle sub figure,and s=-0.5 in the right sub figure

Figure 3 Φη(r) with θ=1.s=-1.5 in the left sub figure,s=-1 in the middle sub figure,and s=-0.5 in the right sub figure

Figure 4 Ψ(τ) and H (η) for the loss function Vθ,s with θ=0.25 and s≤-1 or-1<s≤0.Ψ(τ) with θ=0.25 and s≤-1 or s=-0.5 in the left sub figure,H (η) with θ=0.25 and s≤-1 or s=-0.5 in the right sub figure

Letting Ψ be the function defined in Definition 3.3,according to the analysis of[37],the following lemma holds:

Lemma 3.6([37])(1) For any measurable functionf:X→R,any nonnegative loss functionV,and any probability distribution onX×{-1,+1},it holds that

Proposition 3.7LetVθ,s(r) be the loss function defined as (1.2).Then for any measurable functionf:X→R,it holds that

ProofFor any measurable functionf:X→R,it is immediately apparent from (3.13) that

According to the results given in Lemma 3.5,for allτ∈(0,1],we have that Ψ(τ)>0.By Lemma 3.4,it follows that the loss functionVθ,s(r) is classification-calibrated,and that Ψ is invertible on[0,1].We now analyze the comparison inequality between R (sgn (f))-R (fc) andfrom two cases according the value of the parameters.

Case 1:s≤-1.Since~Ψ(τ)=|τ|is a convex function,by Lemma 3.6,we have that

Takingτ=R (sgn (f))-R (fc),it follows that,fors≤-1,we have

It is easy to see that forθ=1 ors→-∞,the comparison relation,given by Proposition 3.7,is in accordance with the comparison relation (3.1) associated with the hinge loss.

3.2 The learning rates

In this subsection,we give results about the excess misclassification error of the algorithm (2.3).Let us begin with some concepts regarding convex analysis (see[40]).

Let (H,‖·‖H) be a Hilbert space,and letF(f) be a function defined on a convex setS⊂H.We say thatFis a quasiconvex function onSif

holds for allf1,f2∈Sand anyα∈[0,1].

If,for allf1,f2∈Sand anyα∈[0,1],it holds that

then we callFa convex function onS.

Letting B (f,ε)={g∈H:‖g-f‖H<ε}be theε-ball of Hilbert space H,we callV:H→R the local Lipschitz nearf∈H if,for someε>0,it holds that

whereLis called the Lipschitz constant,which depends uponfandε.

A nonnegative continuous functionV:R→[0,∞) is called locally Lipschitz loss if,for alla≥0,there exists a constantca>0 such that

Moreover,fora≥0,the smallestcais denoted byLa.Finally,if we havethen we say thatVis Lipschitz continuous.Ifa=+∞,then we say thatV(t) is a Lipschitz loss on R.

For the loss functionVθ,s(r) defined as (1.2),it is easy to show thatVθ,s(r) is a quasiconvex function on R and thatVθ,s(r) is a Lipschitz continuous function with the Lipschitz constantLθ,s=1 forr>sandLθ,s=θforr≤s,i.e.,

Definition 3.8Let (H,‖·‖H) be a Hilbert space,and letF(f) be a quasiconvex function from H to R∪±∞.We callD+F(f,h) the right-sided Gâteaux directional derivative ofFatf∈H with respect tohif there existsD+F(f)∈H such that

Moreover,D+F(f,h) can be written as

It is easy to see that ifFis Lipschitz continuous on H,then,for∀h∈H,we have that

whereLis the Lipschitz constant,i.e.,

By the definition of D+F (f),we know that ifF(f) attains its minimal value atf0∈H,then

We now state the main result of this paper.

Theorem 3.9Let (HK,‖·‖K) be the reproducing Hilbert space associated with the kernelKx(·)=K(x,·),Vθ,s(r) be the loss function defined as (1.2),andand R (f) be defined as above.Then,for anyδ∈(0,1),with 1-δ,it holds that

Remark 3.10The K-functionaldenotes the approximation error.By the Lipschitzian ofVθ,s,we know that there exists a constantC>0 such thatsatisfies

The decay rate for the right-hand side of (3.23) may be described by a modulus of smoothness ([33]),and the convergence ofis determined by the capacity of HK(see[41,42]).

Remark 3.11If the parametersθandλare chosen such that form→+∞we have that,then,for a givenδ∈(0,1),with 1-δ,we have that

and the convergence rate can be controlled by choosing the homotopy parameters properly.

Remark 3.12Whens→-∞,we have that|I|s→mand|O|s→0.Therefore,the learning rates given have the homotopy continuous property.

4 Proofs

Now we are in a position to prove Theorem 3.9.We need to prove the following lemmas first:

Lemma 4.1LetF(f):HK→R∪±∞be a convex function.Then,for anyf,g∈HK,it holds thatF(f)-F(g)≥D+F(g,f-g)=〈D+F(g),f-g〉HK.

ProofSinceF(f) is a convex function,for anyf,g∈HKandα∈[0,1],it holds thatF(αf+(1-α)g)≤αF(f)+(1-α)F(g).This implies thatF(g+α(f-g))≤F(g)+α(F(f)-F(g)),so we have that

Lettingα→0+in the left of inequality (4.1),our conclusion follows. □

Lemma 4.2Let (HK,‖·‖K) be the reproducing Hilbert space associated with the kernelKx(·)=K(x,·),and letVθ,s(r) be the loss function defined as in (1.2).Then,for anyh∈HK,it holds that

ProofBy the definition of,we have that

The reproducing property of HKyields that

Taking the above formula into (4.4),we have that

This proves (4.2).Using the same argument as above,we obtain equation (4.3).The proof is complete. □

By the same token,we have that

Since Ω(f) and Ωz(f) are also strict convex functions,we have that

Lemma 4.3Letbe defined as in (4.7) and (4.8),respectively.Then,

ProofCombining the definitions ofwith Lemma 4.1,we have that

Combining (4.11) with (4.2) and (4.3),we have that

We have thus proven (4.9). □

Lemma 4.4([43]) Letξbe a random variable taking values in a real separable Hilbert space H on a probability space (Ω,F,P).Assume that there is a positive constantLsuch that

Then,for alln≥1 and 0<δ<1,it holds that

The next Lemma gives a bound for the sample error.

Lemma 4.5Letbe defined as in (4.7) and (4.8),respectively.Then,for anyδ∈(0,1),it holds,with 1-δ,that

ProofAccording to Lemma 4.3,it follows that

Then (4.12) implies that

where we have used the fact that|D+[1-·]+|≤1.

According to Lemma 4.4,for anyδ∈(0,1),with,it holds that

By the same token,for anyδ∈(0,1),with,we have that

(4.13) and (4.15),together with (4.16),yield our conclusion. □

A bound for the generalization error is provided in the following Lemma:

Lemma 4.6LetVθ,s(r) be the loss function defined as in (1.2),and letandbe defined as above.Then,for anyδ∈(0,1),with 1-δ,it holds that

ProofApplying the Lipschitz continuity of the loss functionVθ,swith the Lipschitz constantsatisfyingLθ,s≤1,and according to Lemma 4.5,we obtain that

It is now obvious that the lemma holds. □

For anyδ∈(0,1),by Lemma 4.6 and Proposition 3.7,with 1-δ,it holds that

Thus,we have completed the proof of Theorem 3.9. □