靳燕 彭新光
Started with the regional distribution characteristics, a composite classification model learned on multiple isolated subdomains was proposed to further study the class imbalance problem. In the subdomains division stage, each class was described as ultrasmall spheres by improved Support Vector Data Description (SVDD) algorithm, then class domain was divided into intensive and sparse domains. Some instances were founded out from the boundaries of classes and composed of class overlapping domains. In the subdomains cleanup stage, according to sample availability parameters related to domain tightness, noise data was cleaned up by improved KNearest Neighbor (KNN). After combining classifiers sequentially which were learned on isolated subdomains, the Composite Classification model (CCRD) was generated. The classification performance of CCRD is compared with those algorithms' which are divided into three groups of similar ones, MetaCost and SMOTE. CCRD could classify more positive instances than the top two groups and more negative instances than the last group, and the improvement doesn't affect the classification of the other class. The results indicate that the composite classification model learned on multiple isolated subdomains could classify instances more accurately and could be an effective method for imbalance class.
In the comparison with similar algorithms including SVM (Support Vector Machine), KNN, C4.5 and MetaCost, CCRD can obviously improve the accuracy of positive instances without increasing mistake of negative instances; in the comparison with SMOTE (Synthetic Minority Oversampling TEchnique) sampling, CCRD can improve the misjudgement of negative instances without affecting the classification of the positive instances; in the experiments on five datasets, the classification performance of CCRD is also improved, especially in Haberman_sur. Experimental results indicate that the composite classification model learned on multiple isolated subdomains has excellent classification capability, and it is an effective method for inbalanced dataset.
英文关键词Key words:
regional distribution of imbalanced class; Support Vector Data Description (SVDD); sparse and overlapping domains; leaning classifiers on multiple isolated subdomains; Composite Classification model (CCRD)
不均衡数据集中,按类样本出现概率的大与小定义多数类与少数类。少数类虽出现概率极小但往往又极其重要。如:通信电话中的骚扰电话、被诊患者中的癌症患者、网络用户行为中的攻击行为、卫星图片中的油井图片等。抽样是使不均衡数据集均衡化的常用方法,增加少数类样本以减少类间偏斜的方法称为over_sampling(向上采样) [1]。文献[2]提出SMOTE(Synthetic Minority Oversampling TEchnique)算法,在“近距离少数类样本间仍为同类样本”的假设基础上生成人工少数类样本;文献[3]选用遗传算法优化SMOTE参数,以得到适合应用数据集的取样倍数;文献[4]提出BorderlineSMOTE方法,将人工合成样本限于少数类样本的边界处。减少多数类样本以降低类间偏斜的方法称为under_sampling(向下采样)[5]。最早由Tomek提出的Tomek links方法按距离寻找分属两类的Tomek links对,并删除其中的多数类样本;
H. Alhammady和K. Ramamohanarao提出的EPRC(Emerging Patterns for Rareclass Classification)算法提供三个改进步骤来克服EP算法在少数类分类上的不足[6]。目前,不少研究将两种抽样相结合[7-8],以优势互补。
分析上述两类抽样方法,均以改变样本分布为出发点。文献[9]选用了10种分类方法对13个数据集进行分类性能分析,认为数据集的不均衡不仅体现在样本数量的偏差上,类间重叠亦对分类性能产生影响。文献[10]使用人工数据较系统地分析了类间重叠与类不均衡的关联程度,并有望在实际数据集中进一步作关联分析。文献[11]尝试通过K近邻(KNearest Neighbor, KNN)方法来寻找类间重叠区域,距离查找法易受到“维灾”效应影响,且当重叠子域较多时,该方法在有效发现各类边界特征时性能较差。
本文设计的分类算法是三类算法的扩展与组合,在同一数据集上,依次单个应用三类算法(支持向量机(Support Vector Machine, SVM)、KNN(K=5)、C4.5)和本文算法CCRD。实验分类学习产生的统计结果见表3。
