吴疆 董婷 蒋平
摘要:
應用半监督学习方法拉普拉斯支持向量机(Laplace Support Vector Machine, LapSVM)对蛋白质结构类进行预测。首先7个氨基酸理化性质参数作为替代模型将蛋白质序列转换为数字序列,自协方差变换(AutocrossCovariance, AC)用来描述具有一定间隔氨基酸残基之间的相互关系并将数字序列变换为统一长度的向量,构建样本的特征空间。然后在数据集中分别随机挑选20、50、80、110、140、170个样本作为无标签样本构建训练集,一对多分解策略和留一法用来评价LapSVM模型的预报能力。分类器对蛋白质样本类预测正确率为94.12%,与标准支持向量机算法(Support Vector Machine, SVM)方法90.69%的预测精度相比有明显的竞争力。实验结果有效验证了无标签样本的分布信息作为弱规则能有效提升分类器的预报性能。同时提供了一种新颖的思路,应用半监督方法解决全监督学习问题,更小的优化规模,更好的预报能力。
关键词:
半监督学习; 蛋白质结构类; 拉普拉斯支持向量机; 自协方差变换
中图分类号: TP 391
文献标志码: A
Protein Structural Classes Prediction by Using Laplace Support
Vector Machine and Based on Semisupervised Method
WU Jiang1, DONG Ting1, JIANG Ping1,2
(1. Department of Information Engineering ,Yulin University, Yulin, Shanxi 719000, China;
2. School of Computer Science and Technology, Xidian University, Xian, Shanxi 710071, China)
Abstract:
The purpose of the study is to predict protein structural classes by using Laplace support vector machine (LapSVM) which is a novel semisupervised learning method. Firstly, seven amino acid physicochemical properties cited from literature was applied to transform the protein sequences into numeric vectors, and auto covariance (AC) was used in transforming the physicochemical properties of the amino acids of given proteins into features space with the same size, which is suitable for training models. AC focuses on the neighboring effects and the interactions between residues with a certain distance apart in protein sequences. Secondly, 20, 50, 80, 110, 140 and 170 samples were randomly selected as unlabelled samples to construct training datasets, “oneagainstall” strategy and leaveoneout method were employed to estimate the performance. The prediction accuracy 94.12% was obtained, and it is very promising compared with the accuracy 90.69% predicted by Support Vector Machine (SVM). The experimental results proofed that the unlabelled samples input as weak rules can lightly improve the prediction performances, simultaneously, a novel idea is using semisupervised method to solve a supervised learning problem intends to less optimal scale and higher prediction accuracy.
Key words:
semisupervised learning; protein structural class; Laplace support vector machine; auto correlation