靳燕 彭新光
摘要:
為进一步弱化数据不均衡对分类算法的束缚,从数据集区域分布特性着手,提出了不均衡数据集上基于子域学习的复合分类模型。子域划分阶段,扩展支持向量数据描述(SVDD)算法给出类的最小界定域,划分出域内密集区与域外稀疏区。借鉴不同类存在相似样本的类重叠概念,对边界样本进行搜索,组合构成重叠域。子域清理阶段,基于邻近算法(KNN)的邻近性假设,结合不同域的密疏程度,设置出样本有效性参数,对域内样本逐个检测以清理噪声。各子域隔离参与分类建模,按序组合产生出用于不均衡数据集的复合分类器CCRD。实验设计三类比较方案(相似算法比较、代价敏感MetaCost比较和SMOTE抽样比较),前两类比较中,CCRD对正类的正确分类改善明显,且未加重负类误判;后一类比较中,改善了负类的误判情形,且未影响正类的正确分类。在五类数据集的逐个比较中,CCRD分类性能均有提升,在Haberman_sur的正类分类性能提升上尤为明显。结果表明,基于子域学习的复合分类模型的分类性能较好,是一种研究不均衡数据集的较有效的方法。
为进一步弱化数据不均衡对分类算法的束缚,从数据集区域分布特性着手,提出了不均衡数据集上基于子域学习的复合分类模型。子域划分阶段,扩展支持向量数据描述(SVDD)算法给出类的最小界定域,划分出域内密集区与域外稀疏区。借鉴不同类存在相似样本的类重叠概念,对边界样本进行搜索,组合构成重叠域。子域清理阶段,基于邻近算法(KNN)的邻近性假设,结合不同域的密疏程度,设置样本有效性参数,对域内样本逐个检测以清理噪声。各子域隔离参与分类建模,按序组合产生出用于不均衡数据集的复合分类器CCRD。在相似算法对比以及代价敏感MetaCost对比中,CCRD对正类的正确分类改善明显,且未加重负类误判;在SMOTE抽样比较中,CCRD改善了负类的误判情形,且未影响正类的正确分类;在五类数据集的逐个比较中,CCRD分类性能均有提升,在Haberman_sur的正类分类性能提升上尤为明显。结果表明,基于子域学习的复合分类模型的分类性能较好,是一种研究不均衡数据集的较有效的方法。
关键词:
不均衡数据集区域分布;支持向量数据描述;稀疏域与重叠域;子域隔离学习;复合分类器
中图分类号:
TP391
文献标志码:A
Abstract:
Started with the regional distribution characteristics, a composite classification model learned on multiple isolated subdomains was proposed to further study the class imbalance problem. In the subdomains division stage, each class was described as ultrasmall spheres by improved Support Vector Data Description (SVDD) algorithm, then class domain was divided into intensive and sparse domains. Some instances were founded out from the boundaries of classes and composed of class overlapping domains. In the subdomains cleanup stage, according to sample availability parameters related to domain tightness, noise data was cleaned up by improved KNearest Neighbor (KNN). After combining classifiers sequentially which were learned on isolated subdomains, the Composite Classification model (CCRD) was generated. The classification performance of CCRD is compared with those algorithms' which are divided into three groups of similar ones, MetaCost and SMOTE. CCRD could classify more positive instances than the top two groups and more negative instances than the last group, and the improvement doesn't affect the classification of the other class. The results indicate that the composite classification model learned on multiple isolated subdomains could classify instances more accurately and could be an effective method for imbalance class.
In the comparison with similar algorithms including SVM (Support Vector Machine), KNN, C4.5 and MetaCost, CCRD can obviously improve the accuracy of positive instances without increasing mistake of negative instances; in the comparison with SMOTE (Synthetic Minority Oversampling TEchnique) sampling, CCRD can improve the misjudgement of negative instances without affecting the classification of the positive instances; in the experiments on five datasets, the classification performance of CCRD is also improved, especially in Haberman_sur. Experimental results indicate that the composite classification model learned on multiple isolated subdomains has excellent classification capability, and it is an effective method for inbalanced dataset.
英文关键词Key words:
regional distribution of imbalanced class; Support Vector Data Description (SVDD); sparse and overlapping domains; leaning classifiers on multiple isolated subdomains; Composite Classification model (CCRD)
0引言
不均衡数据集中,按类样本出现概率的大与小定义多数类与少数类。少数类虽出现概率极小但往往又极其重要。如:通信电话中的骚扰电话、被诊患者中的癌症患者、网络用户行为中的攻击行为、卫星图片中的油井图片等。抽样是使不均衡数据集均衡化的常用方法,增加少数类样本以减少类间偏斜的方法称为over_sampling(向上采样) [1]。文献[2]提出SMOTE(Synthetic Minority Oversampling TEchnique)算法,在“近距离少数类样本间仍为同类样本”的假设基础上生成人工少数类样本;文献[3]选用遗传算法优化SMOTE参数,以得到适合应用数据集的取样倍数;文献[4]提出BorderlineSMOTE方法,将人工合成样本限于少数类样本的边界处。减少多数类样本以降低类间偏斜的方法称为under_sampling(向下采样)[5]。最早由Tomek提出的Tomek links方法按距离寻找分属两类的Tomek links对,并删除其中的多数类样本;
H. Alhammady和K. Ramamohanarao提出的EPRC(Emerging Patterns for Rareclass Classification)算法提供三个改进步骤来克服EP算法在少数类分类上的不足[6]。目前,不少研究将两种抽样相结合[7-8],以优势互补。
分析上述两类抽样方法,均以改变样本分布为出发点。文献[9]选用了10种分类方法对13个数据集进行分类性能分析,认为数据集的不均衡不仅体现在样本数量的偏差上,类间重叠亦对分类性能产生影响。文献[10]使用人工数据较系统地分析了类间重叠与类不均衡的关联程度,并有望在实际数据集中进一步作关联分析。文献[11]尝试通过K近邻(KNearest Neighbor, KNN)方法来寻找类间重叠区域,距离查找法易受到“维灾”效应影响,且当重叠子域较多时,该方法在有效发现各类边界特征时性能较差。
本文在相关文献的研究[9-12]基础上,从数据集区域密疏分布特性着手,引入最小超球体进行类描述,界定出密集与稀疏域,按概率定义找出类边界的重叠样本,将样本空间划分为密集、重叠与稀疏三个子域。不同于传统分类方式,本文将各子域隔离参与分类建模,分类模型按序组合构成复合分类器。实验结果比较表明,本文设计的复合分类器的分类性能提升较明显,总体较稳定。
1样本空间密疏域隔离方法设计
类边界区域中,存在与其他类极为相似的样本,该类样本往往会被误分类,在覆盖稀疏小样本的分类学习中尤其明显,类边界会因此类样本而发生偏移,误分更为严重。
当不同类的样本极其相似时,类重叠就会产生。选用概率法定义类重叠:样本x均以大于0的概率落在至少两个类的界定内,则样本x位于重叠域O。即:x,x∈O,p(Cm|x)>0且p(Cn|x)>0。
图1的(a)和(b)分别表示没有重叠和有重叠的情形。重叠造成两类边界模糊,为分类带来困难。若将类的边界控制在样本密集区,并与稀疏区域相隔离,类重叠可大大降低。
1)与单个分类算法的性能比较。
本文设计的分类算法是三类算法的扩展与组合,在同一数据集上,依次单个应用三类算法(支持向量机(Support Vector Machine, SVM)、KNN(K=5)、C4.5)和本文算法CCRD。实验分类学习产生的统计结果见表3。
同一数据集的比较中,CCRD算法的TP值为最大,对正类样本的正确分类数最多;TP与TN存在互约关系,CCRD的TN值虽非最大,但与最大值较为接近。CCRD算法修正了SVM在Haberman_sur上的“正类全部错判”,同时对负类仍有很好的分类效果。从实验结果得出:CCRD算法作为三类算法的扩展与组合,总体分类效果得到了较大提升,分类性能要优于单个算法的性能。
2)与单个分类算法在SMOTE集下的分类性能比较。
SMOTE算法通过生成少数类样本来降低类间不均衡性,是从改变样本分布角度来研究不均衡问题的。本文所提方法则充分分析了数据集分布子域特点,并依据该特点选用合理算法进行分类学习。与SMOTE相比,是同一问题的不同角度的设计。因此,设计实验比较SMOTE集上的相近算法与本文算法的性能。
在同一数据集上,应用SMOTE(倍数为2)进行预处理,并依次单个应用三类算法(SVM、KNN(K=5)、C4.5)进行分类学习,统计结果见表4。
5结语
数据集的不均衡对分类算法性能影响较大,其不均衡性不仅体现在样本数量悬殊上,也体现在样本分布的密疏不均与类重叠上。本文从数据样本分布的密疏程度出发,进行样本区域分隔,根据不同域的分布特点,针对性选用不同分类方法进行学习,多个分类模型按序组合成复合分类器。从多组数据集的不同算法性能对比分析中得出,本文所用分类器在给定的评价性能指标中,总体提升明显,且受数据集特性影响较小,总体性能较稳定。
本文所提方法中,擴展SVDD进行区域划分时,实验选择了一种核函数,样本去噪时的近邻集计算是按固定K值进行的,并据此给出定值βK。方法实现中涉及的相关参数等在今后研究中,尝试以优值选取方式进一步分析给定,以将本文方法推广至多分类问题和大数据集的有效应用中。
參考文献:
[1]
ABDI L, HASHEMI S. To combat multiclass imbalanced problems by means of oversampling and boosting techniques [J]. Knowledge and Data Engineering, IEEE Transactions on, 2015, 19(12): 3369-3385.
ABDI L, HASHEMI S. To combat multiclass imbalanced problems by means of oversampling and boosting techniques [J]. Soft Computing, 2015, 19(12): 3369-3385.
[2]
VERBIEST N, RAMENTOL E, CORNELIS C, et al. Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection [J]. Applied Soft Computing, 2014, 22(5): 511-517.
[3]
霍玉丹,谷琼,蔡之华,等.基于遗传算法改进的少数类样本合成过采样技术的非平衡数据集分类算法[J].计算机应用,2015,35(1):121-124.(HUO Y D, GU Q, CAI Z H, et al. Classification method for imbalance based on genetic algorithm improved synthetic minority oversampling technique [J]. Journal of Computer Applications, 2015,35(1):121-124.)
[4]
WANG K J, ADRIAN A M, CHEN K H, et al. A hybrid classifier combining borderlineSMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: a case study in Taiwan [J]. Computer Methods and Programs in Biomedicine, 2015, 119(2): 63-76.
[5]
YU H, NI J, ZHAO J. ACOSampling: an ant colony optimizationbased undersampling method for classifying imbalanced DNA microarray data [J]. Neurocomputing, 2013, 101(3): 309-318.
[6]
GARCABORROTO M, MARTNEZTRINIDAD J F, CARRASCOOCHOA J A. A survey of emerging patterns for supervised classification [J]. Artificial Intelligence Review, 2014, 42(4): 705-721.
[7]
陈睿,张亮,杨静,等. 基于BSMOTE和逆转欠抽样的不均衡数据分类算法[J]. 计算机应用研究,2014,31(11):3299-3303.(CHEN R, ZHANG L, YANG J, et al. Classification algorithm for imbalanced data sets based on combination of BSMOTE and inverse under sampling [J]. Application Research of Computers, 2014,31(11):3299-3303.)
[8]
GALAR M, FERNNDEZ A, BARRENECHEA E, et al. A review on ensembles for the class imbalance problem: bagging, boosting, and hybridbased approaches [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463-484.
[9]
GARCA V, SNCHEZ J S, MOLLINEDA R A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance [J]. KnowledgeBased Systems, 2012, 25(1): 13-21.
[10]
ALEJO R, VALDOVINOS R M, GARCA V, et al. A hybrid method to face class overlap and class imbalance on neural networks and multiclass scenarios [J]. Pattern Recognition Letters, 2013, 34(4): 380-388.
[11]
BECKMANN M, EBECKEN N F F, DE LIMA B S L P. A KNN undersampling approach for data balancing [J]. Journal of Intelligent Learning Systems and Applications, 2015, 7(4): 104-116.
[12]
熊海涛,吴俊杰,刘洪甫,等.分类中的类重叠问题及其处理方法研究[J].管理科学学报,2013,16(4):8-21.(XIONG H T, WU J J, LIU H P, et al. Towards classification with class overlapping [J]. Journal of Management Sciences in China, 2013,16(4):8-21.)
[13]
KHAZAI S, SAFARI A, MOJARADI B, et al. Improving the SVDD approach to hyperspectral image classification [J]. IEEE Geoscience and Remote Sensing Letters, 2012, 9(4): 594-598.
[14]
蔣盛益,苗邦,余雯.基于一趟聚类的不平衡数据下抽样算法[J].小型微型计算机系统,2012,33(2):232-236.(JIANG S Y, MIAO B, YU W. Undersampling method based on onepass clustering for imbalanced data distribution [J]. Journal of Chinese Computer Systems, 2012, 33(2): 232-236.)
[15]
李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost [J]. 计算机学报,2012,35(2):202-209.(LI X F, LI J, DONG Y F, et al. A new learning algorithm for imbalanced dataPCBoost [J]. Chinese Journal of Computers, 2012, 35(2): 202-209.)
[16]
曹鹏,李博,栗伟,等.基于粒子群优化的不均衡数据学习[J].计算机应用,2013,33(3):789-792.(CAO P, LI B, LI W, et al. Imbalanced data learning based on particle swarm optimization [J]. Journal of Computer Applications, 2013, 33(3): 789-792.)