新一代测序技术数据产生模型与优化处理方法的发展

2016-05-30 10:23李轩朱艳
科技创新导报 2016年11期
关键词:误差分析实验设计数学模型

李轩 朱艳

摘 要:面对超大规模测序数据的处理方法与处理能力的挑战,该课题从各种新一代测序技术平台,包括Illumina/Solexa、Roche/454、AB/SOLiD和国产AG-100/200测序系统等数据产生的源头出发,研究数据的特点、实验设计策略和数据处理方法, 发展新一代测序技术中的编码模型和高通量实验设计理论与方法,研究各种测序平台数据的数学模型和质量控制方法,发展高通量测序数据的高效处理方法及工作流程和跨平台数据的统合分析方法。在研究发展新一代测序技术和序数据的数学模型和质量控制方法的基础上,建立新一代测序的编码和实验设计理论。这些理论方法,对测序数据处理提供重要的指导的同时,将改进我国自主研发的新一代测序仪AG系统。该课题将建立适应多种平台、针对多种应用的新一代测序数据处理方法、算法、可重构软件工作流程和和跨平台数据统合分析方法,并开发面向大数据量序列数据处理的硬件加速技术;课题的进展将推动我国生物信息学和高通量测序技术的研究发展进入世界前沿行列。在课题工作实施的一年多的时间里,围绕着课题的主要方向目标,各个参与团队和合作单位积极开展工作,取得了一些突出的进展,为后一阶段工作的开展完成打下了良好的基础。主要进展包括研究发展了一套新的解码合成测序技术体系,研究建立了测序误差模型和原始测序数据处理算法,建立了AG测序系统数据处理软件的框架,并完成了该系统的主要模块发展;研究建立了测序误差模型和原始测序数据处理算法;面向多样本测序实验的编码理论和方法,建立了测序样本编码优化设计方法,提出双标签编码的高通量測序文库制备方案;研究了基于群试理论(Group testing)的样本混合(pooled DNA samples)编码方法,提出了面向Pool-Seq实验的均衡编码设计算法和基于超几何分布计算的分组设计算法;进行了对不同高通量测序技术平台、不同组学应用(基因组、转录组)的数据特征分析,完成多套应用案例;完成了高通量测序的转录组数据(RNA-seq)的数据处理和拼装优化流程;申请多项专利技术,形成我国自主产权的新一代测序技术的核心技术体系。

关键词:新一代测序 技术平台 数学模型 误差分析 实验设计 优化

Abstract:Challenges of processing methods and capabilities of large scale sequencing data, generated from a variety of next-generation sequencing platforms, including Illumina/Solexa, Roche/454,AB/ OLiD sequencing systems and domestic AG-100/200, are the focus of bioinformatics today. We develop the experimental design strategies and the data processing methods, the high-throughput experimental coding model design, and methods of quality control data for a variety of sequencing platforms. On the basis of mathematical models and methods of quality control research and sequence data on the establishment of coding theory and experimental design of next-generation sequencing, we will provide new theoretical methods of sequencing data processing that give important guidance to improve our self-developed next-generation sequencing AG system. This strategy will be used to adapt to a variety of platforms for a variety of next-generation sequencing data processing methods, algorithms, software reconfigurable workflow, and the development of sequence data for the hardware acceleration technology. The progress will promote the bioinformatics and sequencing technology to enter the ranks of the world's frontier. In more than a year of work implementation, with the goals surrounding the main directions of the target, in cooperation with the participating teams, we made outstanding progress, and carried out work to lay a good foundation for the later stages of the project. Major research progress, including a new set of decoding sequencing by synthesis technology system to study the establishment of a sequencing error model and raw sequencing data processing algorithms to establish a framework AG sequencing data processing software system. To study the sequencing error model and raw sequencing data processing algorithms, the oriented coding theory and methods with various sequencing experiments, a sequencing sample code optimization design method is proposed to double tag encoding high-throughput sequencing library preparation programs. Test sample mixed group of pooled DNA samples encoding method is proposed for the Pool-Seq balanced design algorithms and packet-based encoding algorithm design with hypergeometric distribution calculations. The applications of different genomics data analysis, and assembly optimization were filed for patent application. They form the core of our own proprietary system for next-generation sequencing technology.

Key Words:Nextgen sequencing; Technological platform; Mathematics model; Error analysis; Experiment design; Optimization

阅读全文链接(需实名注册):http://www.nstrs.cn/xiangxiBG.aspx?id=49573&flag=1

猜你喜欢
误差分析实验设计数学模型
AHP法短跑数学模型分析
活用数学模型,理解排列组合
不同的温度
有趣的放大镜
哪个凉得快?
无字天书
电学计量的误差分析及不确定度理论探究分析
古塔形变的数学模型