宋娜娜,钟金城,柴志欣,汪琦,何世明,吴锦波,蹇尚林,冉强,蒙欣,胡红春
三江黄牛全基因组数据分析
宋娜娜1,2,钟金城1,2,柴志欣1,2,汪琦1,2,何世明3,吴锦波3,蹇尚林4,冉强5,蒙欣5,胡红春4
(1西南民族大学动物遗传育种学国家民委-教育部重点实验室,成都 610041;2西南民族大学青藏高原研究院成都 610041;3阿坝州畜牧科学研究所,四川汶川 623000;4阿坝州畜牧工作站,四川汶川 623000;5汶川县畜牧工作站,四川汶川 623000)
【目的】研究三江黄牛群体遗传多样性,从基因组层面讨论其群体遗传变异情况。【方法】提取50个体基因组总DNA,等浓度等体积混合,构建混合样本DNA池,利用CovarisS2进行随机打断基因组DNA,电泳回收长度500 bp的DNA片段,构建DNA文库。应用Illumina HiSeq 2000测序,最终得到测序数据。利用BWA软件将短序列比对到牛参考基因组(UMD 3.1),来检测三江黄牛基因组突变情况。SAMtools、Picard-tools、GATK、Reseqtools对重测序数据进行分析,Ensembl、DAVID、dbSNP数据库对SNPs和indels进行注释。【结果】全基因组重测序分析共计得到77.8 Gb序列数据,测序深度为25.32×,覆盖率为99.31%。测序得到778 403 444个reads和77 840 344 400个碱基,比对到参考基因组(UMD 3.1)的reads为673 670 505,碱基为67 341 451 555,匹配率分别为86.55%和86.51%,成对比对上的reads数为635 242 898(81.61%),成对比对上的碱基数为63 512 636 924(81.59%);共确定了20 477 130个SNPs位点和1 355 308个indels,其中2 147 988个SNPs(2.4%)和90 180个indels(6.7%)是新发现的。总SNPs中,鉴定出纯合SNPs989 686(4.83%),杂合SNPs19 487 444(95.17%),纯合/杂合SNP比为1﹕19.7。转换数为14 800 438个,颠换为6 680 058个,转换/颠换(TS/TV)为2.215。剪切位点突变SNP727个,开始密码子变非开始密码子SNP117个,提前终止密码子的SNP530个,终止密码子变非终止密码子SNP88个。检测到非同义突变数为57 621,同义突变为83 797,非同义/同义比率为0.69。检测到非同义SNPs分布在9 017个基因上,其中发现567个基因与已报道的重要经济性状相符,肉质、抗病、产奶、生长性状、生殖等相关基因的数量分别为471、77、21、10、8个,其中包括功能相重叠的基因;indels数据中,缺失数量为693 180(51.15%),插入数量为662 148(48.85%),纯合indels数量为161 198(11.89%),杂合indels数量1 194 110(88.11%),大部分的变异都位于基因间隔区和内含子区;三江黄牛全基因组杂合度()、核苷酸多样性()及theta W分别为7.6×10-30.0 0390.0 040,说明其遗传多样性较为丰富。三江黄牛群体Tajima'D为-0.06 832,推测可能由于群体内存在不平衡选择所致。【结论】本研究为进一步分析与经济性状相关的遗传学机制和保护三江黄牛品种遗传多样性提供了基因组数据支持。
三江黄牛;基因组;第二代测序技术;SNP;indel
【研究意义】三江黄牛原产于四川省阿坝藏族自治州汶川县,其中三江、白石、水磨和映秀等乡镇为主产区,在理县、茂汶等县市均有分布[1];三江黄牛具有躯干较长、役用性能良好、肉质好、抗病力和适应性强等特点,是经长期自然选择和人工选育而成,在遗传资源上是一个极为宝贵的基因库,但由于当地经济社会的发展和农业生产方式的改变,以及2008年汶川特大地震的发生导致三江黄牛产区功能的布局空间受限,使三江黄牛养殖规模、种群数量锐减,已濒临灭绝,因此,保护三江黄牛遗传资源显得尤为重要[2]。【前人研究进展】基因组包含了一个物种全部的遗传信息,全基因测序是解读基因组的核心技术,揭示基因组的多样性和信息的复杂性,最早的测序技术源于20世纪60年代中期[3-5],70年代后期第一代测序体系逐渐建立,主要有Sanger等[6]发明的双脱氧链末端终止法、Maxam等[7]发明的化学降解法;随着生物技术的发展,二代测序技术逐渐走入大众视野,即大规模平行测序平台(massively parallel DNA sequeneing platform),主要包括:焦磷酸测序Roche/454 FLX、基于边合成边测序的Illumina/Solexa技术和边连接边测序的SOLID技术。测序技术不断发展,有助于对基因组进行更全面和更深入的解析,使得解析稀有物种的基因组,以及对转录组、表达谱、小RNA等大规模的功能基因组学的研究成为可能。人类基因组计划的完成,开启了不同物种全基因组测序的时代。牛基因组测序和HapMap计划的完成[8-9],鉴定出相当数量的遗传变异,其中单核苷酸多态性(SNP)是研究最为广泛的遗传变异类型,用于鉴定基因与牛表型变异相关的基因组区域,现已测序了多个牛种的全基因组[8,10-17]。利用Illumina HiSeq II平台,Eck等[8]在花斑牛上检测到240万SNPs,利用相同的测序平台,KAWAHARA-MIKI等[10]共鉴定了603万SNPs,采用SOLID技术,STOTHARD等[11]成功比对了黑安格斯和荷斯坦公牛的基因组变异,确定了约700万个SNPs和790拷贝数变异。同时,WGS-SNP位点衍生到全基因组关联研究,能够以更高的精度预测物种的重要经济性状,以及检测整个基因组的重要信息[18-19]。【本研究切入点】近年来对黄牛基因组层面的变异研究较多,但对三江黄牛全基因组研究尚未见报道,对三江黄牛品种的遗传资源研究相对匮乏。【拟解决的关键问题】从基因组层面揭示三江黄牛的变异情况,探讨三江黄牛遗传多样性,为进一步分析与经济性状相关的遗传学机制和保护三江黄牛品种遗传多样性提供基因组数据支持。
1.1 供试材料
样本采集于2015年4月,地点是四川省阿坝州汶川县的三江乡和水磨镇,两乡镇是三江黄牛主要分布区域,选取毛色黄色、体型较大、特征明显的三江黄牛个体,采集其耳组织样50个,75%乙醇保存,带回实验室倒出保存液,-80℃保存备用。
1.2 DNA文库的构建及测序
采用苯酚-氯仿法提取基因组DNA,1.5%琼脂糖凝胶电泳和A260/A280的比值检测DNA的纯度和浓度,将50个样本的基因组DNA等浓度等体积混合,利用CovarisS2进行随机打断,电泳回收所需长度的DNA片段(—500 bp),加上接头,进行cluster制备,最后应用Illumina HiSeq 2000测序仪,Paired-end法对插入片段进行测序,双端测序的长度为150 bp,最终得到测序数据。
1.3 测序数据的质量控制和数据过滤
为保证数据的质量,测序原始数据要经过质量控制控和数据过滤,在信息分析前对数据进行质控,并通过数据过滤来减少数据的噪音。通过分析碱基的组成和质量值可控制原始数据的质量(图1)。由图1-a可以看出测序得到低质量(Q<20)碱基含量较低,图1-b看出A、T曲线重合,G、C曲线重合,说明碱基组成平衡,测序结果较好,可以进一步对数据进行处理分析。将得到的原始测序序列(raw reads)里有部分带接头或低质量的reads进行过滤,得到高质量的净数据(clean data),后续分析都基于净数据。数据过滤主要是去除带接头的成对reads;去除单端read中N碱基(N表示无法确定碱基信息)比例大于10%的成对reads;当单端测序read中含有低质量(质量值Q≤7)碱基数超过该条read碱基总数的30%时,去除此成对reads。
(a)碱基质量值;(b)碱基分布比
1.4 基因组数据组装和测序数据处理
利用BWA[20]软件将序列比对到参考基因组。应用工具包SAMtools、Picard-tools对比对结果进行统计、预处理(排序,去重复等),基因组分析工具包(genome analysis tool kit, GATK)[21]完成SNP/indel检测,即经比对在获得样本所有SNP信息的基础上,将检测到的基因型与参考序列之间存在多态性的位点进行过滤,得到高可信度的SNP/indel数据集,将所得到的SNPs和indels调用为VCF格式,比对到dbSNP数据库,鉴定新的SNPs及indels。Break Dancer[22]工具包分析结构变异(SV),最后应用Reseqtools[23]工具对变异结果进行注释统计作图等。
2.1 数据产出
2.1.1 净数据 测序共获得三江黄牛基因组77.8G净数据(Clean data),将所得到的Clean data进行统计(表1)。以普通牛基因组(UMD 3.1)(GCA_000003055.3)为参考,使用BWA软件[21]将clean reads比对到参考基因组(表2),测序深度为25.32×,覆盖率达99%以上,说明具有较高的单碱基正确性,比对到参考基因组reads和碱基的比率分别为86.55%和86.51%,说明测序样品同参考物种相似度高,亲缘关系较近。
2.1.2 染色体测序深度和覆盖度 对三江黄牛每条染色体测序深度和覆盖度统计。整个基因组测序深度为25.32×,深度最高为14号染色体26.55×,最低为X染色体21.54×。基因组的覆盖率为99.22%,其中X染色体覆盖率97.59%。由图2可知,覆盖上的reads和染色体长度成正比。
2.1.3 GC含量 GC含量对测序有一定的影响,高GC和低GC的区域会使测序的难度加大,导致部分序列无法准确测出,由图3可知,整个GC分布范围内覆盖深度较好(25×),GC含量结果无明显偏向性,说明建库与测序质量良好。
表1 三江黄牛净数据
表2 三江黄牛数据比对到参考基因组
图2 reads覆盖各染色体区域的长度
图3 GC含量和测序深度
2.2 SNP检测
GATK的unifiedGenotyper完成对三江黄牛样品的SNP检测,共检测到20 477 130个SNPs位点,SNP密度确定为大约每131个碱基含有一个突变位点,突变分布能够定位各种与经济性状相关联的候选基因组区域。将SNP比对到dbSNP数据库(图4),数据库中共计90 045 399个SNPs,三江黄牛SNPs与数据库未匹配上的为2 147 988个,说明其为新发现的SNPs,占总SNPs的2.4%。纯合SNPs数为989 686(4.83%),杂合SNPs数为19 487 444(95.17%),纯合/杂合SNP比为1﹕19.7。基因间隔区的SNPs为15 009 500,占总SNPs的73.29%,大多数的SNPs集中在基因间隔区和内含子区,少部分在外显子、剪接位点和非编码区域(表4)。转换TS(transition)/颠换TV(transversion)是检测随机序列误差的重要指标,是对SNP的质量评估,经验值TS/TV>2.1[24],三江黄牛SNP转换数为14 800 438个,颠换为6 680 058个,转换/颠换(TS/TV)为2.215(图4),高于经验值,说明所识别大多数的SNP是准确的。在所有SNP中,由于SNP位点突变导致剪切位点突变和编码氨基酸密码子变化的SNP共1 462个,其中剪切位点突变SNP 727个,开始密码子变为非开始密码子SNP 117个,提前终止密码子SNP 530个,终止密码子变非终止密码子SNP 88个,在染色体上分布情况如图5。
图4 SNPs(indels)比对到dbSNP数据库和转换/颠换比
图5 SNPs位点突变效应在每个染色体上的数量分布
人类和其他动物许多表型都与非同义SNP(non- synonymous SNP, nsSNP)相关[25],SNP注释是提供SNP位点与功能相关联的依据。本研究共检测到57 621个非同义突变,83 797个同义突变,非同义/同义SNP比率为0.69。Ensembl[26]数据库对非同义SNP注释得到9 017个基因(电子附表1,http:// pan.baidu.com/s/1qXN18dA)。DAVID[27]数据库对含nsSNP较多的108个基因进行功能基因富集分析(电子附表2,http://pan.baidu.com/s/1qXN18dA),基因功能注释可分为7类,主要集中在生化、代谢、免疫等过程(电子附表3,http://pan.baidu.com/s/1qXN18dA),其中免疫功能相关的GO:0006955涉及到免疫应答,GO:0019882涉及到抗原加工和呈递,对机体有重要作用,包括、、等基因。同时还分析了nsSNP与肉质、产奶量、生长速度等重要经济性状的相关性,发现567个基因与已报道的重要经济性状相关[10,28-34],并对其基因进行注释(电子附表4,http://pan.baidu.com/s/1qXN18dA),471个与肉质相关基因,77个与抗病相关的基因,21个与产奶性状相关基因,10个与生长性状相关的基因,8个与生殖相关的基因,567个基因中还包括一些功能相重叠的基因,例如同时和肉质、生长性状相关的、基因,与肉质和抗病均相关的、基因,产奶和抗病均相关的基因等。还有一些研究相对较多且机制较为清晰的基因,包括生长激素(),生长激素受体()和瘦肉素受体()催乳素受体()基因[29,32]等。
表3 SNPs和indels的注释
图6 SNPs和indels在每个染色体上数量分布
2.3 indel检测
最近研究中认为高于50 bp的缺失(deletion)和插入(insertion)为结构变异,而低于50 bp的deletion和insertion合称indel[35]。indel是基因组中除SNP数量最多的变异,三江黄牛共检测到1 355 308个indels,缺失和插入数量分别为693 180和662 148个,比例分别为51.15%、48.85%,比对到数据库共发现90 180个新indels,占总indels的6.7%(图4)。纯合indels为161 198(11.89%),杂合indels为1 194 110(88.11%)。indels注释情况(表3),基因间隔区的indels最多为982 443个,占总indels的72.49%。CDS区indels1 545个,外显子indels4 137个,3′端非编码区和5′端非编码区indels数分别为3 606和240。SNPs和indels在每个染色体上数量分布(图6),除11、12、13号染色体上的SNP外,其余每条染色体上的SNP数均与染色体长度相关,随染色体长度的减小而降低。indels在每条染色体上的长度分布随染色体长度减小而降低。插入和缺失在CDS区和全基因组上的分布情况(图7),由图可知,缺失和插入的数量在全基因组上的分布随长度增加而减少,在CDS区未发现类似情况。但由两图都可看出插入和缺失数量集中在1—10 bp,其中1—3 bp最多。基于成对比对上reads的结果,检查插入的长度是否异常,针对缺失部分进行分析,共检测出1 906个结构变异。
2.4 基因组的杂合度和群体核苷酸多样性指数
杂合度()和核苷酸多样性()是反映多态性高低的指标(图8)。将reads比对到参考基因组,识别三江黄牛19 487 444个杂合SNPs,其整个基因组的杂合度为7.6×10-3,说明三江黄牛品种的遗传多样性较高。三江黄牛全基因组为0.0039,说明遗传多样性较为丰富。
2.5 基因组Tajima'D和theta W指数
群体Tajima'D值是目标DNA序列在进化过程中是否遵循中性进化模型,导致D值为负可能是搭载效应[36]。三江黄牛群体Tajima'D为-0.06832(图9),说明群体中存在不平衡的选择。theta W是反映群体多态性的指标,是群体在核苷酸多态性水平上偏离中性进化且处于突变-漂移平衡的理想模型,三江黄牛基因组theta W为0.0040(图9),说明三江黄牛群体遗传多样性较为丰富。
图7 Indels在全基因组上的长度分布
图8 indels在CDS上的长度分布
图9 (a)全基因组的杂合率H;(b)全基因组的多态性指标Pi;(c)全基因组的多态性指标theta W;(d)全基因组的Tajima'D
在研究中,笔者使用Illumina 2000测序平台对濒危三江黄牛品种进行了全基因组测序,三江黄牛品种数量少,选择能够代表品种的个体进行测序尤为重要,为避免个体差异,在较低成本下聚集更多个体的信息来反映三江黄牛品种群体遗传多样性情况,因此将50个体采用混合DNA样本进行测序。测序获得三江黄牛基因组77.8G净数据,86.55%的reads、81.61%的成对reads、86.51%的碱基、81.59%的成对碱基比对到参考基因组,测序深度为25.32×,覆盖率达99%以上,具有较高的单碱基正确性,与先前对普通牛测序的结果相比[8,11],测序深度较高,检测到的变异充分可信[15]。
通过GATK分析,在29条常染色体和X染色体上共发现20 477 130个SNPs位点和1 355 308个indels,三江黄牛种群数量较少,仅2 000多头,SNP数量较多,说明该品种遗传多样性较为丰富。总SNP中,杂合SNPs 19 487 444个(95.17%),纯合SNPs 989 686个(4.83%),与Shin等[37]测序10个韩国公牛所得到2 234 514个(90.3%)杂合SNPs和239 370个(9.7%)的纯合SNPs结果相似。纯合/杂合SNPs的比为1﹕19.7,明显低于Kawahara-Mik等[10]研究的日本牛(1﹕1.2)和Choi等[12]研究的韩牛(1﹕1.92)。从测序角度阐述纯和SNP是该混和样本中的所有样品在这个位点都是同一个碱基型且和参考基因组一致,杂合SNP是这个位点在所有混和样品里有多个碱基型,三江黄牛SNP纯合比率低,杂合度高,推测可能测序的50个体之间变异差异较大。还可能是三江黄牛选育程度低,近亲繁殖的概率低,与其他牛之间的基因交流较多,本身特异的功能基因正在丢失,保护三江黄牛品种显得尤为重要。测序所得indels大约占总变异(indels和SNPs)的5.3%,稍高于Kawahara-Mik等[10]和Choi等[12]研究的结果。三江黄牛转换/颠换值为2.21,高于Choi等[16]研究韩国牛牛种所得到2.1。将SNPs和indels比对到数据库发现2 147 988个新的SNPs和90 180个新的indels,分别占总SNPs的2.4%和总indels的6.7%,推测可能由于近年来随着基因组测序的发展,发现了较多新的SNPs及indels,数据库越来越完善,使比对上的比例明显增大,新发现的逐渐变少。大多数indels的长度较短,缺失的长度范围在1—29 kb,插入的范围在1—44 kb,缺失和插入的数量集中在1—10 bp,其中1—3 bp最多,从人类基因组数据上也观察到类似现象[38]。三江黄牛数据中,接近84.7%插入和79.6%缺失长度小于3 kb。29个常染色体上检测到的SNPs和indels与染色体长度成正比,结果符合预期,其中X染色体突变率最低,为4.33%,在小种群研究上,X染色体相比常染色体有较低的突变率[39]。
通过Ensemble数据库对nsSNP注释得到9 017个基因,与Eck等[8]报道的结果相一致,高于Kawahara-Mik等[10]研究的日本牛和Lee等[14]研究的韩国牛。非同义SNP注释发现567个与经济性状相关的基因,其中肉质、抗病、产奶、生长、生殖等相关的基因分别为471、77、21、10、8个。三江黄牛主要是供耕地役用,近年来逐渐向役肉兼用方向发展,研究中一些肉质相关基因中在其他黄牛品种上已有报道,例如脂肪酸结合蛋白4(FABP4)的突变与棕榈油酸肌肉内脂肪含量相关[40],加压素Ⅱ受体(UTS2R)突变与骨骼肌脂肪堆积相关[41],钙蛋白酶1(CAPN1)突变与阿伯丁安格斯牛肉嫩度有关[42],和基因被发现与内洛尔、荷斯坦黑白花、安格斯、夏洛来、海福特、西门塔尔牛牛肉的脂肪含量有关[34,43],肌联蛋白基因()也被发现是影响肉质重要的候选基因[44],在本研究中、等肉质相关基因上分布了较多nsSNP(>10)。还有一些与发育和疾病相关的基因,如和基因与荷斯坦牛育种值相关[45],蛋白激酶()基因与早期胚胎发育相关[46],Y连锁肽重复序列结构域()在公牛精子发生过程中发挥关键作用[47],性别决定区Y()是检测公牛精子质量和生育能力的重要候选标记[48]。、和是与瑞士褐牛韦弗综合征疾病相关的重要候选基因[49]。哺乳动物中的色素沉淀是由于毛发或皮肤缺乏或存在黑色素引起的,影响色素合成的主要基因有酪氨酸酶蛋白1()、多巴色素互变异构酶(、丝氨酸肽酶()、黑皮素受体1()、酪氨酸酶()、信号蛋白()。在啮齿动物,黑色和黄色间的变化是由和拮抗剂所引起的。MITF基因能够调节,和基因的表达。三江黄牛毛色以黄色为主,其次为黑色和草黑色,在其他牛种上很多与颜色相关的基因在三江黄牛上也发现了,如CORIN基因发现与韩牛黄毛色有关[37],推测可能也是影响三江黄牛黄色毛的重要基因。、、、、等基因发现能够导致头发毛囊表型变化[50]。
本研究通过对三江黄牛全基因组测序得到77.8 Gb净数据,鉴定出大量的遗传变异,说明三江黄牛遗传多样性较为丰富,为进一步研究三江黄牛品种的遗传特性提供基因组数据支持。
[1] 孙福勇, 刘君. 三江黄牛的生态分布及其品种特点. 草业与畜牧, 2009(9): 51-52.
SUN F Y, LIU J. Distribution and ecological features of Sanjiang cattle breed., 2009(9): 51-52.(in Chinese)
[2] 陈智华, 顾磊, 钟金城. 三江黄牛 Bola-DRB3 基因第二外显子的 PCR-RFLP 多态性研究. 西南民族大学学报(自然科学版), 2008, 33(4): 782-787.
CHEN Z H, GU L, ZHONG J C. Study on the polymorphism of the Bola-DRB3 gene exon 2 in the Sanjiang Cattle by PCR-RFLP method.(), 2008, 33(4): 782-787. (in Chinese)
[3] HOLLEY R W, EVERETT G A, MADISON J T, ZAMIR A. Nucleotide sequences in the yeast alanine transfer ribonucleic acid., 1965, 240(5): 2122-2128.
[4] FRESCO J R, ADAMS A, ASCIONE R, HENLEY D, LINDAHL T. Tertiary structure in transfer ribonucleic acids//Cold Spring Harbor Laboratory Press, 1966, 31: 527-537.
[5] CELANDER D W, CECH T R. Visualizing the higher order folding of a catalytic RNA molecule., 1991, 251(4992): 401-407.
[6] SANGER F, NICKLEN S, COULSON A R. DNA sequencing with chain-terminating inhibitors., 1977, 74(12): 5463-5467.
[7] MAXAM A M, GILBERT W. Sequencing end-labeled DNA with base-specific chemical cleavages., 1979, 65(1): 499-560.
[8] ECK S H, BENET-PAGÈS A, FLISIKOWSKI K, MEITINGER T, FRIES R, STROM T M. Whole genome sequencing of a single Bos taurus animal for single nucleotide polymorphism discovery., 2009, 10(8): R82.
[9] GIBBS R A, BELMONT J W, Hardenbol P, WILLIS T D, YU F L, YANG H M, CHANG L Y, HUANG W, LIU B, SHEN Y, et al. The international HapMap project., 2003, 426(6968): 789-796.
[10] KAWAHARA-MIKI R, TSUDA K, Shiwa Y, ARAI-KICHISE Y, MATSUMOTO T, KANESAKI Y, ODA S, EBIHARA S, YAJIMA S, YOSHIKAWA H, KONO T. Whole-genome resequencing shows numerous genes with nonsynonymous SNPs in the Japanese native cattle Kuchinoshima-Ushi., 2011, 12(1): 103.
[11] STOTHARD P, CHOI J W, BASU U, SUMNER-THOMSON J M, MENG Y, LIAO X, MOORE S S. Whole genome resequencing of black Angus and Holstein cattle for SNP and CNV discovery., 2011, 12(1): 1.
[12] CHOI J W, CHUNG W H, LEE K T, LEE K T, CHOI J W, JUNG K S, CHO Y, KIM N, KIM T H. Whole genome resequencing of Heugu (Korean Black Cattle) for the genome-wide SNP discovery., 2013, 33(6): 715-722.
[13] CHOI J W, LIAO X, PARK S, JEON H J, CHUNG W H, STOTHARD P, PARK Y S, LEE J K, LEE K T, KIM S H, OH J D, KIM N, KIM T H, LEE H K, LEE S J. Massively parallel sequencing of Chikso (Korean brindle cattle) to discover genome-wide SNPs and InDels., 2013, 36(3): 203-211.
[14] LEE K T, CHUNG W H, LEE S Y, CHOI J W, KIM J, LIM D, LEE S,JANG G W, KIM B, CHOY Y H, LIAO X, STOTHARD P, MOORE S S, LEE S H, AHN S, KIM N, KIM T H. Whole-genome resequencing of Hanwoo (Korean cattle) and insight into regions of homozygosity., 2013, 14(1): 519.
[15] CHOI J W, LIAO X, STOTHARD P, CHUNG W H, JEON H J, MILLER S P, CHOI S Y, LEE J K, YANG B, LEE K T, HAN K J, KIM H C, JEONG D, OH J D, KIM N, KIM T H, LEE H K, LEE S J. Whole-genome analyses of Korean native and Holstein cattle breeds by massively parallel sequencing., 2014, 9(7): e101127.
[16] CHOI J W, CHOI B H, LEE S H, LEE S S, KIM H C, YU D, CHUNG W H, LEE K T, CHAI H H, CHO Y M, LIM D. Whole-genome resequencing analysis of hanwoo and yanbian cattle to identify genome-wide SNPs and signatures of selection., 2015, 38(5): 466.
[17] SASAKI S, WATANABE T, NISHIMURA S, SUGIMOTO Y. Genome-wide identification of copy number variation using high-density single-nucleotide polymorphism array in Japanese Black cattle., 2016, 17(1): 1.
[18] DAETWYLER H D, CAPITAN A, PAUSCH H, STOTHARD P, VAN BINSBERGEN R, BRØNDUM R F, LIAO X, DJARI A, RODRIGUEZ S C, GROHS C, ESQUERRÉ D, BOUCHEZ O, ROSSIGNOL M N, KLOPP C, ROCHA D, FRITZ S, EGGEN A, BOWMAN P J, COOTE D, CHAMBERLAIN A J, ANDERSON C, VANTASSELL C P, HULSEGGE I, GODDARD M E, GULDBRANDTSEN B, LUND M S, VEERKAMP R F, BOICHARD D A, FRIES R, HAYES B J. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle., 2014, 46(8): 858-865.
[19] QANBARI S, PAUSCH H, JANSEN S, SOMEL M, STROM T M, FRIES R, NIELSEN R, SIMIANER H. Classic selective sweeps revealed by massive sequencing in cattle., 2014, 10(2): e1004148.
[20] LI H, DURBIN R. Fast and accurate short read alignment with Burrows-Wheeler transform., 2009, 25(14): 1754-1760.
[21] MCKENNA A, HANNA M, BANKS E, SIVACHENKO A, CIBULSKIS K, KERNYTSKY A, GARIMELLA K, ALTSHULER D, GABRIEL S, DALY M, DEPRISTO M A. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data., 2010, 20(9): 1297-1303.
[22] CHEN K, WALLIS J W, MCLELLAN M D LARSON D E, KALICKI J M, POHL C S, MCGRATH S D, WENDL M C, ZHANG Q, LOCKE D P, SHI X, FULTON R S, LEY T J, WILSON R K, DING L, MARDIS E R. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation., 2009, 6: 677-681.
[23] HE W, ZHAO S, LIU X, DONG S, LV J, LIU D, WANG J, MENG Z. ReSeqTools: an integrated toolkit for large-scale next-generation sequencing based resequencing analysis., 2013, 12(4): 6275-6283.
[24] 1000 genomes project consortium, ABECASIS G R, AUTON A, BRODKS L D, DEPRISTO M A, DURBIN R M, HANDSAKER R E, KANG H M, MARTH G T, MCVEAN G A. an integrated map of genetic variation from 1,092 human genomes., 2012, 491(7422): 56-65.
[25] STENSON P D, BALL E V, MORT M, PHILLIPS A D, SHIEL J A, THOMAS N S, ABEYSINGHE S, KRAWCZAK M, COOPER D N. Human gene mutation database (HGMD®): 2003 update., 2003, 21(6): 577-581.
[26] FLICEK P, AMODE M R, BARRELL D, BEAL K, BRENT S, CARVALHOSILVA D, CLAPHAM P, COATES G, FAIRLEY S, FITZGERALD S, GIL L, GORDON L, HENDRIX M, HOURLIER T, JOHNSON N, KÄHÄRI A K, KEEFE D, KEENAN S, KINSELLA R, KOMOROWSKA M, KOSCIELNY G, KULESHA E, LARSSON P, LONGDEN L, MCLAREN W, MUFFATO M, OVERDUIN B, PIGNATELLI M, PRITCHARD B, RIAT H S, RITCHIE G R S, RUFFIER M, SCHUSTER M, SOBRAL D, TANG Y A, TAYLOR K, TREVANION S, VANDROVCOVA J, WHITE S, WILSON M, WILDER S P, AKEN B L, BIRNEY E, CUNNINGHAM F, DUNHAM L, DURBIN R, FERNÁNDEZ, SUAREZ X M, HARROW J, HERRERO J, HUBBARD T J P, PARKER A, PROCTOR G, SPUDICH G, VOGEL J, YATES A, ZADISSA A, SEARLE S M J. Ensembl 2012., 2011: gkr991.
[27] HUANG D W, SHERMAN B T, LEMPICKI R A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources., 2009, 4(1): 44-57.
[28] HIENDLEDER S, THOMSEN H, REINSCH N, BENNEWITZ J, LEYHE-HORN B, LOOFT C, XU N, MEDJUGORAC I, RUSS I, KÜHN C, BROCKMANN G A, BLÜMEL J, BRENIG B, REINHARDT F, REENTS R, AVERDUNK G, SCHWERIN M, FÖRSTER M, KALM E, ERHARDT G. Mapping of QTL for body conformation and behavior in cattle., 2003, 94(6): 496-506.
[29] NKRUMAH J D, LI C, YU J, HANSEN C, KEISLER D H, MOORE S S. Polymorphisms in the bovine leptin promoter associated with serum leptin concentration, growth, feed intake, feeding behavior, and measures of carcass merit., 2005, 83(1): 20-28.
[30] HU Z L, FRITZ E R, REECY J M. AnimalQTLdb: a livestock QTL database tool set for positional QTL information mining and beyond., 2007, 35(suppl. 1): D604-D609.
[31] HU Z L, REECY J M. Animal QTLdb: beyond a repository. A public platform for QTL comparisons and integration with diverse types of structural genomic information., 2007, 18(1): 1-4.
[32] THOMAS M G, ENNS R M, SHIRLEY K L, GARCIA M D, GARRETT A J, SILVER G A. Associations of DNA polymorphisms in growth hormone and its transcriptional regulators with growth and carcass traits in two populations of Brangus bulls., 2007, 6(1): 222-237.
[33] BAGNATO A, SCHIAVINI F, ROSSONI A, MALTECCA C, DOLEZAL M, MEDUGORAC I, SÖLKNER J, RUSSO V, FONTANESI L, FRIEDMANN A, SOLLER M, LIPKIN E. Quantitative trait loci affecting milk yield and protein percentage in a three-country Brown Swiss population., 2008, 91(2): 767-783.
[34] FERRAZ J B S, PINTO L F B, MEIRELLES F V, ELER J P, DE REZENDE F M, OLIVEIRA E C, ALMEIDA H B, WOODWARD B, NKRUMAH D. Association of single nucleotide polymorphisms with carcass traits in Nellore cattle., 2009, 8(4): 1360-1366.
[35] ALBERS C A, LUNTER G, MACARTHUR D G, MCVEAN G, OUWEHAND W H, DURBIN R. Dindel: accurate indel calls from short-read data., 2011, 21(6): 961-973.
[36] NIELSEN R. Molecular signatures of natural selection., 2005, 39: 197-218.
[37] Shin Y, Jung H J, Jung M, Yoo S I, Subramaniyam S, Markkandan K, Kang J M, Rai R, Park J, Kim J J. Discovery of gene sources for economic traits in Hanwoo by whole- genome resequencing., 2016, 29(9): 1353-1362.
[38] FUJIMOTO A, NAKAGAWA H, HOSONO N, NAKANO K, ABE T, BOROEVICH K A, NAGASAKI M, YAMAGUCHI R, SHIBUYA T, KUBO M, MIYANO S, NAKAMURA Y, TSUNODA T. Whole- genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing., 2010, 42(11): 931-936.
[39] MAKOVA K D, LI W H. Strong male-driven evolution of DNA sequences in humans and apes., 2002, 416(6881): 624-626.
[40] HOASHI S, HINENOYA T, TANAKA A, OHSAKI H, SASAZAKI S, TANIGUCHI M, OYAMA K, MUKAI F, MANNEN H. Association between fatty acid compositions and genotypes of FABP4 and LXR-alpha in Japanese Black cattle., 2008, 9(1): 1.
[41] JIANG Z, MICHAL J J, TOBEY D J, WANG Z, MACNEIL M D, MAGNUSON N S. Comparative understanding of UTS2 and UTS2R genes for their involvement in type 2 diabetes mellitus., 2008, 4(2): 96-102.
[42] GILL J L, BISHOP S C, MCCORQUODALE C, WILLIAMS J L, WIENER P. Association of selected SNP with carcass and taste panel assessed meat quality traits in a commercial population of Aberdeen Angus-sired beef cattle., 2009, 41(1): 36.
[43] BUCHANAN F C, FITZSIMMONS C J, VAN KESSEL A G, Thue T D, Winkelman-Sim D C, Schmutz S M. Association of a missense mutation in the bovine leptin gene with carcass fat content and leptin mRNA levels., 2002, 34(1): 105-116.
[44] Watanabe N, Satoh Y, Fujita T, Ohta T, Kose H, Muramatsu Y, Yamamoto T, Yamada T. Distribution of allele frequencies atand*between Japanese Black and four other cattle breeds with differing historical selection for marbling., 2011 4: 10.
[45] LIEFERS S C, VEERKAMP R F, PAS M F W, DELAVAUD C, CHILLIARD Y, LENDE T A. missense mutation in the bovine leptin receptor gene is associated with leptin concentrations during late pregnancy., 2004, 35(2): 138-141.
[46] YANG Q E, OZAWA M, ZHANG K, JOHNSON S E, EALY A D. The requirement for protein kinase C delta (PRKCD) during preimplantationbovine embryo development., 2014, 28(4): 482-490.
[47] LIU Y, QIN X, SONG X Z, JIANG H Y, SHEN Y F, DURBIN K J, LIEN S, KENT M P, SODELAND M, REN Y R, ZHANG L, SODERGREN E, HAVLAK P, WORLEY K C, WEINSTOCK G M, GIBBS R A. Bos taurus genome assembly., 2009, 10(1): 1.
[48] MISHRA C, PALAI T K, SARANGI L N, PRUSTY B R, MAHARANA B R. Candidate gene markers for sperm quality and fertility in bulls., 2013, 6: 905-910.
[49] McClure M, Kim E, Bickhart D, Null D, Cooper T, Cole J, Wiggans J, Ajmone-Marsan P, Colli L, Santus E, Liu G, Schroeder S, Matukumalli L, Tassell C V, Sonstegard T. Fine mapping for Weaver syndrome in Brown Swiss cattle and the identification of 41 concordant mutations across NRCAM, PNPLA8 and CTTNBP2., 2013, 8(3): e59251.
[50] Van den Bossche J, Malissen B, Mantovani A, De Baetselier P J A, Ginderachter V. Regulation and function of the E-cadherin/catenin complex in cells of the monocyte- macrophage lineage and DCs., 2012, 119(7): 1623-1633.
(责任编辑 林鉴非)
The Whole Genome Data Analysis of Sanjiang Cattle
SONG Nana1,2, ZHONG Jincheng1,2, CHAI Zhixin1,2, WANG Qi1,2, HE Shiming3,WU Jinbo3, JIAN Shanglin4, RAN Qiang5, MENG Xin5, HU Hongchun4
(1Key Laboratory of Animal Genetics and Breeding of State Ethnic Affairs Commission and Ministry of Education, Southwest University for Nationalities, Chengdu 610041;2Institute of Tibetan Plateau Research, Southwest University for Nationalities, Chengdu 610041;3Animal Husbandry Science Institute of ABa Autonomous Prefecture, Wenchuan 623000, Sichuan;4Animal Husbandry and Veterinary Station of Aba Autonomous Prefecture, Wenchuan 623000, Sichuan;5Animal Husbandry and Veterinary Station of Wenchuan, Wenchuan 623000, Sichuan)
【Objective】The objective of this paper is to study the genetic diversity of Sanjiang cattle group and discuss its genetic variation at the genome level.【Method】Fifty individual genomic DNA were extracted and mixed with isocratic and equal volumes, then the DNA pool of the mixed samples were constructed. Genomic DNA was interrupted randomly by using CovarisS2 and the DNA fragments of 500 bp were recovered by electrophoresis, and DNA library was constructed at last. Finally, the sequencing data were obtained through the Illumina HiSeq 2000. The short reads were mapped to bovine reference genome (UMD 3.1) to detect the genomic mutations of Sanjiang cattle using BWA software. The analysis of the re-sequencing data was implemented using SAMtools, Picard-tools, GATK, Reseqtools, the SNPs and indels were annotated based on the Ensembl, DAVID and dbSNP database. 【Result】A total of 77.8 Gb of sequence data were generated by whole-genome sequencing analysis, 99.31% of the reference genome sequence was covered with a mapping depth of 25.32-fold, 778 403 444 reads and 77 840 344 400 bases were obtained, of which 673 670 505 reads and 67 341 451 555 bases covered 86.55% and 86.51% of bovine reference genomes (UMD 3.1) respectively, paired-end reads mapping were 635 242 898 (81.61%), paired-end bases mapping were 63 512 636 924 (81.59%). A total of 20 477 130 SNPs and 1 355 308 small indels were identified, of which 2 147 988 SNPs (2.4%) and 90 180 (6.7%) indels were found to be new. Of the total number of SNPs, 989 686 (4.83%) homozygous SNPs and 19 487 444 (95.17%) heterozygous SNPs were discovered, homozygous/heterozygous SNPs was 1﹕19.7. Transitions were 14 800 438, transversions were 6 680 058, transition/transversion (TS/TV) was 2.215. SNPs of splice site mutations were 727,the number of SNPs which the start codon converts into no stop codon were 117, SNPs of premature stop codon were 530, the number of SNPs which stop codon converts into no stop codon were 88. A total of 57 621 non-synonymous SNPs and 83 797 synonymous SNPs were detected, the ratio was 0.69. Non-synonymous SNPs were detected in 9 017 genes, 567 genes were assigned as trait-associated genes, which included meat quality, disease resistance, milk production, growth rate, fecundity with the number of 471, 77, 21, 10, and 8 respectively, the function of some genes were overlap. In detection of indels, 693 180 (51.15%) were deletions and 662 148 (48.85%) were insertions, 161 198 (11.89%) were homozygous and 1 194 110 (88.11%) were heterozygous. Most variations were located in intergenic regions and introns. Heterozygosity (), nucleotide diversity () and theta W of Sanjiang cattle genome-wide were 7.6×10-3, 0.0039, 0.0040, respectively, which indicated that Sanjiang cattle have an abundant genetic diversity. The Tajima'D of Sanjiang cattle population was -0.06832, which speculated that the population exists an unbalanced selection.【Conclusion】Results of this research will provide valuable genomic data for further investigations of the genetic mechanisms underlying traits of interest and protection of Sanjiang cattle breeds genetic diversity.
Sanjiang cattle; genome; next generation sequencing; SNP; indel
2016-06-12;接受日期:2016-11-07
四川省科技厅项目(2015JY0248)、中央高校服务民族地区发展项目(2015NFW01)
宋娜娜,Tel:13688499824;E-mail:songnana28@126.com。通信作者钟金城,E-mail:zhongjincheng518@126.com