基因预测算法中阈值的傅里叶质谱分析

2014-07-02 19:06刘平等
湖北农业科学 2014年6期
关键词:信噪比

刘平等

摘要:蛋白质编码区预测中阈值选择对预测结果的影响不容忽视。研究提出以归一化的功率谱密度作为判别DNA序列编码区和非编码区的阈值,以FIR(Finite impulse response,FIR)窄通带滤波器NPBF(Narrow pass band filter,NPBF)作为编码区预测算法核心,采用DNA序列集HMR195和ALLSEQ作为测试集,以碱基层的近似相关系数 (Approximate correlation,AC)为预测准确率测度指标,对所提出方法与现有方法的预测结果做了比较。结果表明,采用新阈值得到的预测准确率最高,算法简单直观。

关键词:蛋白质编码区预测;窄通带滤波器;归一化的功率谱密度值;信噪比;近似相关系数

中图分类号:TP391.9;TN713 文献标识码:A 文章编号:0439-8114(2014)06-1432-04

Analysis on Threshold Used in Gene Prediction Algorithm Based on Fourier Spectrum

LIU Ping1,MA Yu-tao1,SUN Xue-hong1,ZHANG Cheng1,DU Yong2

(1.School of Physics & Electrical Information Engineering/Ningxia Key Laboratory of Intelligent Sensing for Desert Information, Ningxia University,Yinchuan 750021,China;2.Department of Pediatric Surgery,General Hospital of Ningxia Medical University,Yinchuan 750004,China)

Abstract: Threshold selection of protein coding regions prediction algorithm has important influence on the prediction accuracy. In this paper, a new threshold and normalized value of power spectrum density was proposed to differentiate protein coding regions and non-coding regions. Using the FIR (Finite impulse response) NPBF (Narrow pass-band filter) as the kernel of the prediction algorithm and taking the DNA sequences data sets HMR195 and ALLSEQ as the test sets, the prediction results of the NPBF algorithm with new threshold was compared with those of the same algorithm using other two thresholds. The results were discussed with the AC(Approximate correlation) used as a base level prediction accuracy measure. It was indicated that the proposed threshold was the best choice for higher AC and less amount of computation.

Key words: protein coding regions prediction; narrow pass-band filter; normalized value of power spectrum density; ratio of signal to noise; approximate correlation

蛋白质编码区预测对于DNA序列的注释和标注工作具有很重要的指导意义[1-3]。在现有的蛋白质编码区预测算法中,Tiwari等[4]提出的SDFT(Sliding discrete fourier transform,SDFT)算法使用了信噪比RSN(Ratio of signal to noise,RSN)作为区分编码区和非编码区的阈值;Mena-Chalco等[5]使用预测非编码率PNCR(Predicted non-coding ratio,PNCR)作为阈值;Ambikairajah等[6]、Akhtar等[7]在作DNA序列的PSD(Power spectral density,PSD)曲线图时对曲线的幅度作了归一化处理。面对两种不同的阈值选择,在基因预测时哪一个能给出最好的预测结果,是否还有更好的阈值选择,还需进行研究并确定。

本研究提出采用归一化的功率谱密度(Power spectrum sensity normalized by its maximum value, PSDN)作为区分编码区和非编码区的阈值,采用FIR (Finite impulse response,FIR) NPBF (Narrow pass-band filter,NPBF)蛋白质编码区预测算法作为平台[8,9],采用DNA序列集HMR195[10]和ALLSEQ[11]作为算法的测试序列集,采用Sn(Sensitivity)、Sp(Specificity)、FPR(False positive rate)、AC(Approximate correlation)和CC(Correlation coefficient)作为预测结果的指标[11],比较了RSN、PNCR和PSDN分别作为阈值时的预测结果,为独立基因预测算法中的阈值选择提供参考。

1 材料与方法

1.1 材料

采用基因序列AB003730(序列集HMR195中的一个DNA序列)作为标准序列来比较采用前述3种阈值时蛋白质编码区预测的结果;采用ALLSEQ和HMR195 DNA序列集来验证阈值选择对预测结果产生影响的广泛性。

1.2 NPBF基因预测算法

基于FIR窄通带滤波器的编码区预测算法主要包括以下步骤:①采用Voss法将DNA序列映射成数值序列(信号);②使用FIR窄通带滤波器对前一步得到的数值信号进行滤波,滤除非周期为3的信号;③计算信号的功率谱密度(PSD);④对PSD曲线进行滑动平均滤波和幅度归一化;⑤用非编码率作为阈值对DNA序列进行分类,确定DNA序列中的编码区和非编码区,并以一种或多种预测准确率指标给出预测结果。

采用Voss法将由碱基Adenine (A),Thymine (T),Cytosine (C)和Guanine (G)组成的DNA序列映射为数值序列x1[n],l={A,T,C,G}[1-9],让其通过FIR窄通带滤波器滤波后,得到了周期为3的信号y1[n],l={A,T,C,G}。DNA序列编码信号的功率谱密度

PSD[n]=■■y■[n]■,l=A,T,C,G;n=1,…,L

(1)

式中,N为FIR滤波器长度,L为DNA序列的长度。

在实际编码区预测算法仿真中存在滤波输出序列不够光滑的问题,因此,在统计预测结果之前先采用1个110阶的移动平均滤波器对预测进行平滑处理。式(2)为1个Nma阶的移动平均滤波器的差分方程。

PSD■[n]=■■PSD(n-i)(2)

在计算出序列的移动平均功率谱后,采用其最大值作为标准进行归一化以便于不同算法结果的比较。之后,采用预测非编码率作为阈值,使得阈值范围为1~99,且改变滤波器的长度,以便获得算法的最好预测准确率阈值。本研究用敏感度(Sn)、特异度(Sp)、近似相关系数(AC)和相关系数(CC)来评估算法对编码区的识别性能[11]。其中,AC作为整体预测准确率的测度,便于与其他文献的研究结果进行比较;Sn、Sp作为参考测度,用于对标准序列进行研究。

1.3 3种阈值运算量的比较

以RSN为阈值的预测需要计算每个序列PSD的均值,然后根据RSN计算出与之对应的一个PSDN作为阈值;以PNCR为阈值的预测实际上需要将DNA序列的PSD排序,然后根据指定的PNCR计算出一个与其对应的PSDN作为阈值;以PSDN作为阈值只需要选择一个PSDN即可。因此,以PSDN作为阈值的预测算法的运算量最小。

2 结果与分析

2.1 窄通带滤波器的实现

在编码区预测试验中使用了119和 599两种窗长的APNPBF(All phase NPBF)窄通带滤波器,图1为窗长为599的APNPBF的频率响应。对于DNA序列集中长度小于600 bp的DNA序列,在预测时使用的是窗长为119的滤波器,以减少由于输入序列进行补零等延拓处理造成对预测结果的失真。

2.2 编码区预测结果

采用3种阈值在序列AB003730上进行预测分析,以RSN、PNCR和PSDN为阈值得到的预测曲线分别见图2a、图2b、图2c;3种阈值对应预测结果的ROC曲线见图2d;对ROC曲线左上角的局部进行了放大(图2e)。对于阈值RSN来说,其ROC曲线是通过令RSN以0.08为步长,取0.08~8.00共100个值,将这些阈值获得预测结果的FPR和TPR配对在二维平面上描出的曲线。对于阈值PNCR和PSDN来说,其ROC曲线的获得与RSN相类似,取值范围分别为1≤PNCR≤100和0.01≤PSDN≤1.00,步长都取0.01。ROC曲线下的AUC(Area under the ROC curve)面积越大则表明算法对编码区和非编码区的区分能力越强。

基于3种阈值的序列AB003730的最好预测结果见表1。最好预测结果是指在阈值的某个变化范围内,选用某个具体数值时获得的预测准确率AC最高。对于阈值RSN来说,其选择范围建议取1

采用上述3种阈值分别对ALLSEQ和HMR195DNA序列集进行预测, 结果见表2。由表2可知,PSDN作为阈值在ALLSEQ和HMR195上都获得了最高的预测准确率,同时采用RSN作为阈值预测结果要好于采用PNCR。

RSN作为阈值能够将编码信号强度较强的区域预测为编码区,强调了一个DNA序列中编码区和非编码区PSD大小的差别,但对编码信号较弱和编码信号较强且编码区占DNA序列完整长度比较高的编码区则都不能正确识别;PNCR作为阈值则限定任何序列都只有某一个固定的百分比是编码区,这与实际情况不符;而PSDN作为阈值则只强调了一个DNA序列中编码区具有的周期性的强弱,忽视了非编码区和噪声的作用,在3种阈值中最大限度地提高了编码区被识别的可能性。

3 小结

对独立基因预测算法中的阈值问题进行了研究,提出了一种新的阈值PSDN。结果表明,以PSDN作为阈值获得的预测准确率最好,使NPBF预测算法得到了简化。与以RSN和PNCR为阈值的预测算法相比,明显改善了编码区长度占DNA序列长度比值较高情况下的预测结果。

参考文献:

[1] CHEN B,JI P.Visualization of the protein-coding regions with a self adaptive spectral rotation approach[J]. Nucleic Acids Research,2011,39(1):e3.

[2] MEHER J, MEHER P K,DASH G.Improved comb filter based approach for effective prediction of protein coding regions in DNA sequences[J]. Journal of Signal and Information Processing,2011,2(2):88-99.

[3] MA Y T,CHE J,LU X G,et al. A new algorithm for predicting protein coding regions based on the hybrid threshold [A]. The 2012 5th International Conference on Biomedical Engineering and Informatics[C]. Chongqing:IEEE Engineering in Medicine and Biology Society,2012.846-849.

[4] TIWARI S,RAMACHANDRAN S, BHATTACHARYA A, et al. Prediction of probable genes by fourier analysis of genomic sequences[J]. Computer Applications in the Bioscience,1997, 13(3):263-270.

[5] MENA-CHALCO J P, CARRER H, ZANA Y, et al. Identification of protein coding regions using the modified Gabor-Wavelet transform[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2008,5(2):198-206.

[6] AMBIKAIRAJAH E, EPPS J,AKHTAR M.Gene and exon prediction using time domain algorithms[A]. IEEE 8th Int Symp Symposium on Proceedings of the Eighth International Signal Processing and its Applications[C]. Sydney:Signal Processing and its Applications,2005.199-202.

[7] AKHTAR M, EPPS J,AMBIKAIRAJAH E. Signal processing in sequence analysis:Advances in Eukaryotic gene prediction [J].IEEE Journal of Selected Topics in Signal Processing,2008, 2(3):310-321.

[8] 马玉韬,车 进,关 欣,等.加窗窄通带滤波器蛋白质编码区预测算法[J].数据采集与处理,2013,28(2):129-135.

[9] 马玉韬,轩秀巍,车 进,等.基于全相位滤波理论的基因预测研究[J].上海交通大学学报,2013,47(7):1149-1154.

[10] ROGIC S,MACKWORTH A K,OUELLETTE B F.Evaluation of gene-finding programs on mammalian sequences[J].Genome Research,2001,11(5):817-832.

[11] BURSET M,GUIGO R.Evaluation of gene structure prediction programs[J].Genomics,1996,34(3):353-367.

[2] MEHER J, MEHER P K,DASH G.Improved comb filter based approach for effective prediction of protein coding regions in DNA sequences[J]. Journal of Signal and Information Processing,2011,2(2):88-99.

[3] MA Y T,CHE J,LU X G,et al. A new algorithm for predicting protein coding regions based on the hybrid threshold [A]. The 2012 5th International Conference on Biomedical Engineering and Informatics[C]. Chongqing:IEEE Engineering in Medicine and Biology Society,2012.846-849.

[4] TIWARI S,RAMACHANDRAN S, BHATTACHARYA A, et al. Prediction of probable genes by fourier analysis of genomic sequences[J]. Computer Applications in the Bioscience,1997, 13(3):263-270.

[5] MENA-CHALCO J P, CARRER H, ZANA Y, et al. Identification of protein coding regions using the modified Gabor-Wavelet transform[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2008,5(2):198-206.

[6] AMBIKAIRAJAH E, EPPS J,AKHTAR M.Gene and exon prediction using time domain algorithms[A]. IEEE 8th Int Symp Symposium on Proceedings of the Eighth International Signal Processing and its Applications[C]. Sydney:Signal Processing and its Applications,2005.199-202.

[7] AKHTAR M, EPPS J,AMBIKAIRAJAH E. Signal processing in sequence analysis:Advances in Eukaryotic gene prediction [J].IEEE Journal of Selected Topics in Signal Processing,2008, 2(3):310-321.

[8] 马玉韬,车 进,关 欣,等.加窗窄通带滤波器蛋白质编码区预测算法[J].数据采集与处理,2013,28(2):129-135.

[9] 马玉韬,轩秀巍,车 进,等.基于全相位滤波理论的基因预测研究[J].上海交通大学学报,2013,47(7):1149-1154.

[10] ROGIC S,MACKWORTH A K,OUELLETTE B F.Evaluation of gene-finding programs on mammalian sequences[J].Genome Research,2001,11(5):817-832.

[11] BURSET M,GUIGO R.Evaluation of gene structure prediction programs[J].Genomics,1996,34(3):353-367.

[2] MEHER J, MEHER P K,DASH G.Improved comb filter based approach for effective prediction of protein coding regions in DNA sequences[J]. Journal of Signal and Information Processing,2011,2(2):88-99.

[3] MA Y T,CHE J,LU X G,et al. A new algorithm for predicting protein coding regions based on the hybrid threshold [A]. The 2012 5th International Conference on Biomedical Engineering and Informatics[C]. Chongqing:IEEE Engineering in Medicine and Biology Society,2012.846-849.

[4] TIWARI S,RAMACHANDRAN S, BHATTACHARYA A, et al. Prediction of probable genes by fourier analysis of genomic sequences[J]. Computer Applications in the Bioscience,1997, 13(3):263-270.

[5] MENA-CHALCO J P, CARRER H, ZANA Y, et al. Identification of protein coding regions using the modified Gabor-Wavelet transform[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2008,5(2):198-206.

[6] AMBIKAIRAJAH E, EPPS J,AKHTAR M.Gene and exon prediction using time domain algorithms[A]. IEEE 8th Int Symp Symposium on Proceedings of the Eighth International Signal Processing and its Applications[C]. Sydney:Signal Processing and its Applications,2005.199-202.

[7] AKHTAR M, EPPS J,AMBIKAIRAJAH E. Signal processing in sequence analysis:Advances in Eukaryotic gene prediction [J].IEEE Journal of Selected Topics in Signal Processing,2008, 2(3):310-321.

[8] 马玉韬,车 进,关 欣,等.加窗窄通带滤波器蛋白质编码区预测算法[J].数据采集与处理,2013,28(2):129-135.

[9] 马玉韬,轩秀巍,车 进,等.基于全相位滤波理论的基因预测研究[J].上海交通大学学报,2013,47(7):1149-1154.

[10] ROGIC S,MACKWORTH A K,OUELLETTE B F.Evaluation of gene-finding programs on mammalian sequences[J].Genome Research,2001,11(5):817-832.

[11] BURSET M,GUIGO R.Evaluation of gene structure prediction programs[J].Genomics,1996,34(3):353-367.

猜你喜欢
信噪比
基于经验分布函数快速收敛的信噪比估计器
一种基于扩频信号的散射通信信噪比估计方法
一种基于2G-ALE中快速信噪比的估计算法
无线通信中的信噪比估计算法研究
信噪比在AR模型定阶方法选择中的研究
自跟踪接收机互相关法性能分析
基于深度学习的无人机数据链信噪比估计算法
低信噪比下LFMCW信号调频参数估计
低信噪比下基于Hough变换的前视阵列SAR稀疏三维成像
不同信噪比下的被动相控阵雷达比幅测角方法研究