(华东交通大学 软件学院,江西 南昌,330013)
(华东交通大学 软件学院,江西 南昌,330013)
复杂疾病是一类由多种因素导致且形成机制尚未明晰的人类健康杀手,如精神失常、多发性硬化症及肿瘤等常见疾病,而肿瘤是复杂疾病中最为常见的疾病之一。据中国肿瘤登记年报最新统计结果显示,全国每分钟约有6人被确诊为癌症,并且患者已呈现出年轻化趋势,因此,肿瘤对国民生活质量造成了巨大威胁。单核苷酸多态性(SNP)是一类DNA序列层次的遗传变异,其可能导致调控元件、基因、蛋白质结构等生物分子发生重大改变,使得个体患肿瘤风险增加。目前,全球研究者针对不同肿瘤开展了全基因组关联分析(GWAS),已准确识别了部分重要SNP并已收录至GWAS Catalog[1]。但随着深入分析发现,传统GWAS存在研究结果难以重现,可解释性低及遗传力缺失等不足。缺乏深入理解易感位点之间相互作用(上位性)及孤立地考察SNP数据是导致这些不足的关键因素,从计算机学科角度可大致可归结为三点:第一,全基因组SNP数据中包含有上百万个位点,对生物信息处理中计算方法及硬件资源带来巨大挑战,难以深入挖掘[2];第二,对肿瘤等复杂疾病缺乏系统、完整的认知,导致其定义存在模糊性甚至歧义性,使得病例样本中呈现多种不同的遗传结构(异质性),一定程度上掩盖了遗传变异与肿瘤不同亚型之间相关性[3];第三,肿瘤发生、发展涉及多种生物分子相互作用,仅分析某一层次组学数据将加剧偏离真实疾病模型,从而难以发现真实完备的风险因素,导致遗传力缺失[4]。
[1]Welter D,MacArthur J,Morales J,et al.The NHGRI GWAS Catalog,a curated resource of SNP-trait associations[J].Nucleic Acids Research,2014,42(Database issue):1001-6.
[2]Xiong H Y,Alipanahi B,Lee L J,et al.The human splicing code reveals new insights into the genetic determinants of disease[J].Science,2015,347(6218):1254806.
[3]Urbanowicz R J,Andrew A S,Karagas M R,et al.Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome:a learning classifier system approach[J].Journal of the American Medical Informatics Association,2013,20(4):603-612.
[4]Li P,Guo M,Wang C,et al.An overview of SNP interactions in genome-wide association studies[J].Briefings in functional genomics,2014,14(2):143-55.
[5]Zeng T,Zhang WW,Yu X T,et al.Edge biomarkers for classification and prediction of phenotypes[J].Science China Life Sciences,2014,57(11):1103-1114.
[6]Liu R,Wang X,Aihara K,et al.Early diagnosis of complex diseases by molecular biomarkers,network biomarkers,and dynamical network biomarkers[J].Medicinal research reviews,2014,34(3):455-478.
[7]Ritchie M D,Holzinger E R,Li R,et al.Methods of integrating data to uncover genotype-phenotype interactions[J].Nature Reviews Genetics,2015,16(2):85-97.
[8]Patil N,Berno A J,Hinds D A,et al.Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21[J].Science,2001,294(5547):1719-1723.
[9]Ting C K,Lin W T,Huang Y T.Multi-objective tag SNPs selection using evolutionary algorithms[J].Bioinformatics,2010,26(11):1446-1452.
[10]Liao B,Li X,Zhu W,et al.A novel method to select informative SNPs and their application in genetic association studies[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(5):1529-1534.
[11]Liao B,Li X,Cai L,et al.A Hierarchical Clustering Method of Selecting Kernel SNP to Unify Informative SNP and Tag SNP[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2015,12(1):113-122.
[12]Li X,Liao B,Cai L,et al.Informative SNPs selection based on two-locus and multilocus linkage disequilibrium:Criteria of max-correlation and min-redundancy[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2013,10(3):688-695.
[13]Hung C L,Chen W P,Hua G J,et al.Cloud computing-based tag SNP selection algorithm for Human Genome Data[J].International journal of molecular sciences,2015,16(1):1096-1110.
[14]Wu C,Cui Y.Boosting signals in gene-based association studies via efficient SNP selection[J].Briefings in bioinformatics,2014,15(2):279-291.
[15]Mooney M,Wilmot B,McWeeney S.The GA and the GWAS:using genetic algorithms to search for multilocus associations[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(3):899-910.
[16]Jia P,Zhao Z.Network-assisted analysis to prioritize GWAS results:principles,methods and perspectives[J].Human genetics,2014,133(2):125-138.
[17]Gibson G.Hints of hidden heritability in GWAS[J].Nature genetics,2010,42(7):558-560.
[18]Jing P J,Shen H B.MACOED:a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies[J].Bioinformatics,2015,31(5):634.
[19]Ritchie M D,Hahn L W,Roodi N,et al.Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer[J].The American Journal of Human Genetics,2001,69(1):138-147.
[20]Yang C H,Lin Y D,Yang C S,et al.An efficiency analysis of high-order combinations of gene-gene interactions using multifactor-dimensionality reduction[J].BMC Genomics,2015,16(1):489.
[21]Hemani G,Theocharidis A,Wei W,et al.EpiGPU:exhaustive pairwise epistasis scans parallelized on consumer level graphics cards[J].Bioinformatics,2011,27(11):1462-1465.
[22]Kam-Thong T,Pütz B,Karbalai N,et al.Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs[J].Bioinformatics,2011,27(13):i214-i221.
[23]Sluga D,Curk T,Zupan B,et al.Heterogeneous computing architecture for fast detection of SNP-SNP interactions[J].BMC bioinformatics,2014,15(1):216.
[24]Kässens J C,Wienbrandt L,González-Domínguez J,et al.High-speed exhaustive 3-locus interaction epistasis analysis on FPGAs[J].Journal of Computational Science,2015,9:131-136.
[25]Guo X,Meng Y,Yu N,et al.Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering[J].BMC bioinformatics,2014,15(1):102.
[26]Zhang Y,Liu J S.Bayesian inference of epistatic interactions in case-control studies[J].Nature genetics,2007,39(9):1167-1173.
[27]Mao W,Lee J.A combinatorial analysis of genetic data for Crohn’s disease[C]//Bioinformatics and Biomedical Engineering,2007.ICBBE 2007.The 1st International Conference on.IEEE,2007:1031-1034.
[28]Wang Y,Liu X,Robbins K,et al.AntEpiSeeker:detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm[J].BMC research notes,2010,3(1):117.
[29]Chattopadhyay A S,Hsiao C L,Chang CC,et al.Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions[J].Gene,2014,533(1):304-312.
[30]Ding X,Wang J,Zelikovsky A,et al.Searching high-order SNP combinations for complex diseases based on energy distribution difference[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2015,12(3):695-704.
[31]Zhang F,Boerwinkle E,Xiong M.Epistasis analysis for quantitative traits by functional regression model[J].Genome research,2014,24(6):989-998.
[32]Kam-Thong T,Azencott C A,Cayton L,et al.GLIDE:GPU-based linear regression for detection of epistasis[J].Human heredity,2012,73(4):220-236.
[33]Beam A L,Motsingerreif A,Doyle J.Bayesian neural networks for detecting epistasis in genetic association studies[J].BMC bioinformatics,2014,15(1):368.
[34]Lee I,Blom U M,Wang P I,et al.Prioritizing candidate disease genes by network-based boosting of genome-wide association data[J].Genome research,2011,21(7):1109-1121.
[35]Chen L S,Hutter C M,Potter J D,et al.Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data[J].The American Journal of Human Genetics,2010,86(6):860-871.
[36]Braun R,Buetow K.Pathways of distinction analysis:a new technique for multi-SNP analysis of GWAS data[J].PLos Genetics,2011,7(6):e1002101.
[37]Askland K,Read C,O’Connell C,et al.Ion channels and schizophrenia:a gene set-based analytic approach to GWAS data for biological hypothesis testing[J].Human genetics,2012,131(3):373-391.
[38]Yang C H,Lin Y D,Chaung L Y,et al.Evaluation of breast cancer susceptibility using improved genetic algorithms to generate genotype SNP barcodes[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB),2013,10(2):361-371.
[39]Li X,Liao B,Chen H.A new technique for generating pathogenic barcodes in breast cancer susceptibility analysis[J].Journal of theoretical biology,2015,366:84-90.
[40]Holzinger E R,Ritchie M D.Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies[J].Pharmacogenomics,2012,13(2):213-222.
[41]Huang R S,Duan S,Bleibel W K,et al.A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity[J].Proceedings of the National Academy of Sciences,2007,104(23):9758-9763.
[42]Huang Y T,VanderWeele T J,Lin X.Joint analysis of SNP and gene expression data in genetic association studies of complex diseases[J].The annals of applied statistics,2014,8(1):352.
[43]Kang M,Zhang C,Chun H W,et al.eQTL epistasis:detecting epistatic effects and inferring hierarchical relationships of genes in biological pathways[J].Bioinformatics,2015,31(5):656-664.
[44]Shabalin A A.Matrix eQTL:ultra fast eQTL analysis via large matrix operations[J].Bioinformatics,2012,28(10):1353-1358.
[45]Giacalone G,Clarelli F,Osiceanu A M,et al.Analysis of genes,pathways and networks involved in disease severity and age at onset in primary-progressive multiple sclerosis[J].Multiple Sclerosis,2015:21(11).
[46]王吉光.复杂疾病的分子网络模型研究[J].中国科学:数学 (中文版),2014,44(4):317-328.
[47]Fridley B L,Lund S,Jenkins G D,et al.A Bayesian integrative genomic model for pathway analysis of complex traits[J].Genetic epidemiology,2012,36(4):352-359.
[48]Mankoo P K,Shen R,Schultz N,et al.Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles[J].PLoS One,2011,6(11):e24709.
[49]Kim D,Shin H,Song Y S,et al.Synergistic effect of different levels of genomic data for cancer clinical outcome prediction[J].Journal of biomedical informatics,2012,45(6):1191-1198.
[50]Holzinger E R,Dudek S M,Frase A T,et al.ATHENA:the analysis tool for heritable and environmental network associations[J].Bioinformatics,2014,30(5):698-705.
[51]Drăghici S,Potter R B.Predicting HIV drug resistance with neural networks[J].Bioinformatics,2003,19(1):98-107.
[52]Liang M,Li Z,Chen T,et al.Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2014,12(4):928-937
Methods for mining omics data of complex diseases
LI Xiong
(School of Software,East China Jiaotong University,Nanchang 330013,China)
At present, for a single type of omics data, part of the real genetic and environmental factors associated with the tumor has been excavated, but some still may only be hidden in the complex genetic mechanism behind the tip of the iceberg, The key reason to lead to the limitations may be that the disease model is too simplistic, namely, to ignore the interrelationships between multi-level histological data. Studies thank that deepening the understanding of genome SNP data, further integrating omulti-source histological data, deeply understanding epistasis, heterogeneity and other phenomena, and thereby enhancing the ability of cancer risk assessment, is conducive to the realization of personalized medical goals. This paper analyzes the present data mining methods of complex diseases from the perspective of SNP data and multi-source data analysis.
SNP;genome-wide association study;system biology;machine Learning