Ruilin Li,Xiaoyu He,Chuangchuang Dai,Haidong Zhu ,Xianyu Lang ,Wei Chen ,Xiaodong Li,Dan Zhao ,Yu Zhang ,Xinyin Han ,Tie Niu ,Yi Zhao ,Rongqiang Cao ,Rong He,Zhonghua Lu ,Xuebin Chi,Weizhong Li*,Beifang Niu *
1 Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
2 University of Chinese Academy of Sciences,Beijing 100190,China
3 Center of Scientific Computing Applications&Research,Chinese Academy of Sciences,Beijing 100190,China
4 Guizhou University School of Medicine,Guiyang 550025,China
5 J.Craig Venter Institute,La Jolla,CA 92037,USA
Abstract The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources.Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential.However,existing clustering algorithms perform poorly on long genomic sequences.In this article,we present Gclust,a parallel program for clustering complete or draft genomic sequences,where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays(SSAs).Moreover,genome identity measures between two sequences are calculated based on their maximal exact matches(MEMs).In this paper,we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets.Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust.We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
KEYWORDS Microbial genomeclustering;Parallelization;Sparse suffix array;Maximal exact match;Segment extension
The first complete bacterial genome was published more than 20 years ago[1].During the last decade,the number of sequenced genomes has been growing very rapidly,mainly due to the development of low cost and high throughput DNA sequencing technologies[2].As of the beginning of 2018,the Genomes OnLine Database(GOLD;https://gold.jgi.doe.gov/)has collected data from more than 180 thousand sequencing projects.Most genomic studies have been focusing on microbial species,especially bacteria.Thus,the growth of publically availablebacterial genomeshavebecomesubstantial and the amount of such data pose significant challenges for researchers interested in using these resources efficiently.In addition,these databases host a large portion of redundant genomes from the same or closely related species and the redundancy has to be reduced.
Clustering algorithms are key for redundancy reduction and there have been many of them available including CDHIT[3],UCLUST[4],DNACLUST[5],canopies[6],Linclust[7],CLOSET[8],and SynerClust[9],among others.Most of them are efficient at clustering DNA sequences from hundreds to a few thousands of basepairs,including expressed sequence tags(ESTs),short reads from the next generation sequencers,and amplicon sequences,but less efficient on longer sequences.In fact,these programs are not able to handle typical bacterial genomes of mega basepairs in size.The performances and features of these clustering programs have been reviewed in many publications,such as this recent report[10].
BLASTclust from the BLAST package[11]can be used for clustering long sequences,but it is too slow to process largescale genomic sequences.Other genome alignment tools,such as MUMmer[12],BLASTZ[13],and Mauve[14],are also incapable of clustering large-scale genomic sequences,because they were originally designed to assess genomic variation and rearrangements by pairwise or multiple alignment of a small number of genomes.
Sincesequenceclusteringistime-consuming,most clustering programs use different algorithms to improve performance.For example,CD-HIT[3]uses a heuristic based on short word filtering to reducecomputational load.Besideshort word index tables or hash tables,suffix trees and suffix arrays have also been widely used for sequence comparison.For example,Maldeet al.[15]introduced an ESTclusteringalgorithm,where sub-quadratic time complexity was achieved by using suffix arrays.Another strategy to reduce the overall computational time in clustering a large dataset is through parallelization.For example,a multi-threaded function was introduced in an enhanced version of CD-HIT[16],and this was able to achieve quasi-linear speedup when using up to 8 cores.
In this article,we introduce Gclust,a fast program for clustering microbial genomic sequences.A key algorithm in Gclust for sequence comparison is based on sparse suffix arrays(SSAs).Our method has several key features.First,it is specially designed for clustering very long sequences,of up to typical prokaryotic genomes.Genomic sequences are compared using extended maximal exact matches(MEMs),which are used to calculate genome sequence identity.Second,a fast algorithm was implemented in building SSAs and querying SSAs to identify MEMs.Third,Gclust supports multithreaded parallel computing.
We used four genomic sequence datasets(Table 1)from NCBI to test the performance of Gclust.The first three datasets contain viral,archaeal,and fungal genomic data(ftp://ftp.ncbi.nlm.nih.gov/refseq/release/).The bacterial genomes in the fourth dataset were selected from the NCBI RefSeq genome list(ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_sum mary_refseq.txt)according to the following criteria:(1)genomes assembled at only contig level were excluded;(2)all the NCBI reference genomes and representative genomes were included;and(3)the remaining genomes were included if assembled to complete genomes or chromosomes.
Table 1 Detailed information for the four genomic datasets analyzed using Gclust
Gclust is implemented in C and C++and POSIX threads programming(https://computing.llnl.gov/tutorials/pthreads/)isused for parallelization.Wealso used the SeqAn[17]and libdivsufsort libraries(https://github.com/y-256/libdivsufsort)in the implementation.
A key problem in sequence comparison is pattern matching between sequences.Similar sequences can be detected by common fragments.MEMs are exact matches between two strings that cannot be extended without a gap[18].The classical approach to find MEMs between a pair of sequences is to use suffix trees and search for maximal matching blocks[19].However,suffix trees require about ten to twenty times the memory of the source text, even in optimal implementations.
In order to reduce the memory cost in finding MEMs,Manber and Myers[20]adopted suffix arrays(SAs),a data structure that is a sorted list of all the suffixes of a large text.Later,enhanced suffix arrays(ESAs)replaced suffix trees,since the use of suffix trees often bottlenecked large scale applications[21].Khan et al.[18]introduced another method,where SSAs were used to find MEMs.Recently,another SSA-based tool,essaMEM,has been reported[22].Compared to full-text suffix arrays,sparse suffix arrays store everyK-th suffix of the text and occupy much less memory.
The variable declaration is as follows:d:[s...e]andq:[l...r]are the intervals of query sequenceP.SA(i)is thei-th value of the suffix array.sn(p)is the serial number ofP.Location(SA(i))is the serial number of the sequence which includes thei-th suffixSA(i).LCPmeans the longest common prefix,andLCP[i]is thei-th value of theLCParray.
In order to find MEMs using SSAs,we adapted the method suggested by Khan et al.[18].MEMs are found according to two intervals(d:[s...e]andq:[l...r],whereq:[l...r]is a subinterval ofd:[s...e])and is obtained by a top-down binary search.There are two cases whereby MEMs betweenSandPcan be found(whereSis the reference sequence andPis the query sequence).The first case is when at mostL-(K-1)characters in length are found and the match can be recovered by scanning the region between sparsely indexed suffix positions.HereKis the sparse step of the suffix array andLis the only constraint of the minimum length of MEMs.The other case is when at leastd≥L-(K-1)matched characters are found,and the two intervals are used to determine the length and position of MEMs.
Gclust is a greedy incremental clustering algorithm for genomic sequences.The algorithm is explained with pseudo-codes(Table 2).
The main Gclust parallel algorithm includes(1)sorting the input genome sequences from long to short and(2)dividing the input genome sequences into blocks based on the memory occupied by suffix arrays and process these blocks one after another.
For each block,the following steps are performed.(a)one suffix array isconstructed beforeclustering using therepresentative sequences.The longest sequence is automatically classif ied as the first representative sequence within the block.(b)Each remaining query sequence is searched in the suffix array and iscompared to theexisting representative sequenceslonger than it.The comparison is made by attempting to extend MEMs.If the MEM-based sequence identity satisf ies the user-specified clustering threshold,the query sequence is considered redundant,or is otherwise a new representative sequence.(c)A new suffix array is reconstructed using all the representative sequences found in this block.This new suffix array is used in comparing sequences in the remaining blocks to therepresentativesequencesin thisblock in parallel to identify redundant sequences.(d)The main loop of the algorithm processes the next block with steps(a)through(c)until all blocks are processed.
Given sequencesAandBand a setΣof matched segments between them,the matched sequence problem is to compute a set of non-intersecting matchesΣ’that are all sub matches ofΣ,that maximize the amount of sequence covered by thematched segments.Halpern et al.[23]introduced an efficient method for ref ining a set of matched segments in which the projections of resulting segments onto each sequence were disjoint or identical.However,the method is time-consuming.Since a MEM spans the same length on the two sequences being compared,it is less complicated to refine the MEMs.Deloger et al.[24]designed an approximate solution for computing the maximal unique matches index(MUMi).Here,we used a similar solution to refine MEMs and to compute the sequence identity using refined MEMs(Figure 1).
Table 2 Pseudo-codes of the Gclust algorithm
The procedures are as follows.(1)MEMs whose coordinates on are presentative sequence(or query sequence)are completely included in a larger MEM are removed,e.g.,MEM 1 and MEM 2 in Figure 1A.(2)MEMs whose coordinates on a representative sequence(or query sequence)are completely included in two neighboring MEMs are removed,e.g.,MEM 2 in Figure 1B.(3)The remaining MEMs of a representative sequence(or query sequence)that exhibit partial overlap are trimmed.To do this,MEMs are sorted according to their beginning positions on a representative sequence(or query sequence).Starting from the last element of thelist,each MEM is compared to its leftward neighbor.In cases of overlap,the left end of the current MEM is trimmed,e.g.,MEM 1 in Figure 1C,i.e.,its end coordinates on both the representative and query sequences are shifted rightward so that no overlap exists on the representative sequence(or query sequence)(Figure 1C).(4)The MEMs retained are extended after refinement using the given score matrix(Figure 1D).Whilecomputing the MEM extension,the score matrix isused to give a reward or penalty.We determine the identity between two sequences based on the extended MEMs(eMEMs).This eMEM identity(eMEMi)is calculated using the following formula:
Nmatchis the number of matched nucleotides within extended MEMs andLqueryis the length of shorter sequences.The lengths of the representative sequences are always longer than that of the query sequences,because the sequences are sorted by length in descending order.Thus,eMEMi is used to measure the identity between the representative and the query sequence.Formula(1)relies on the choice ofminlen,which is the minimal size of the exact matches to be included in MEMs.We extend the MEMs by using a function from the SeqAn[17]library.In SeqAn,alignments allow the insertion of gaps into sequences through extension.SeqAn uses a seed-and-extend algorithm to realize extension.In the ungapped cases,matches and mismatches are assigned with scores;these scores are then summed up and the running total will drop when one or more mismatches occur.In the gapped cases,gaps will be created with negative scores(http://seqan.readthedocs.io/en/master/Tutoral/Algorithms/Seed Extension.html#tutorial-algorithms-seed-extension).While computing the MEM extension,the score matrix is used to give a reward or penalty.Theminlenvalue is determined empirically.It has been reported that under the uniform Bernoulli model,no maximum unique matches(MUMs)longer than 21 are expected by chance in 1.7-Mbp random genomes[25].This suggests that aminlenvalue of 21 can avoid many spurious matches.
For a given sparse stepK,there are two major drawbacks in finding MEMs using sparse suffix arrays:(1)the need to run the search procedureKtimes;and(2)a complicated search procedure is required when the MEM is shorter thanK.In Gclust,we only useK=1 for small MEMs to decrease the cost of repeated searches.For longer MEMs,we useK=2—4.Since MEMs shorter than 21 are unlikely to find redundant sequences,our choice ofKavoids the second drawback.For a large MEM or a clustering of 100%identity,a largerKwill shorten the time in constructing the suffix array with little impact on the efficiency of MEM searching.
However,unlike other mapping programs in which the suffix arrays for reference genomes are constructed only once prior to mapping,in Gclust,the suffix arrays for each block areconstructed in real time.Therefore,it isimportant to acceleratethe sorting process of suffix arrayswithin theblock,especially when clustering at 100%identity,when the construction of suffix arrays becomes the most time-consuming step.
According to the greedy incremental clustering algorithm,a sequenceSonly need to be searched against the sequences longer thanSin the pre-constructed the suffix array for that block.Here we implemented a modif ied MEM filtering algorithm(collectMEMs).This approach avoids the need to scan up toPcharacters to the left of the match and then discarding the MEMs located in those sequences that are shorter thanS(Table S1).
The algorithm to find MEMs using sparse suffix arrays by Khan et al.[18]relies on a traverse algorithm to match up toL-(K-1)characters and to find the longest match.If a match of length≥L-(K-1)characters can be obtained,the suffix array interval d:[s...e]corresponding to matches of length≥L-(K-1 )and the interval q:[l...r]corresponding to the longest match are used by the collectMEMs algorithm to find right maximal matches.Each right maximal match must be verif ied for the left by scanning up to K characters using the findL algorithm to the left of the match.In Gclust,for the sorted sequences S:[1...N]in one block,given the query sequence P,we modif ied the collectMEMs algorithm to discard the MEMs located on thesequences shorter than S.
Parallelization techniques used
Three different explicit parallel extensions to the C language are the Message-Passing Interface(MPI),POSIX threads(Pthreads),and OpenMP[26].MPI is used for distributedmemory programming.While OpenMPand Pthreadsare both APIs for shared-memory programming,Pthreads is more f lexible than OpenMP.Due to the advantages of using shared memory,in Gclust we adapted Pthreads to facilitate parallel processing of clustering.
The major part of the Gclust algorithm(Table 2)includes two primary alignment processes(intra-and inter-block).The main computation involves finding MEMs.Multiple query sequences need to be searched in the same suffix array.In Gclust,these are distributed to individual processors or cores.
Results and discussion
The greedy incremental clustering algorithm introduced by the enhanced version of CD-HIT[16]was implemented in Gclust for clustering genomic sequences.In Gclust,genome identity measures of two sequences are calculated based on the extension of their MEMs.We implemented an improved SSA algorithm to find these MEMs.
We tested the performance of Gclust using four Ref Seq genome datasets(viral,archaeal,fungal and bacterial genome data;Table 1).Tests were done on an Era supercomputer with a 24-core Intel(R)Xeon(R)CPU E5-2680 v3@2.50 GHz with 256-GB RAM.
Clustering performance
The cluster results for the four datasets are shown in Table 3.Genomes were clustered at 90%eMEMi.For theviral dataset,9578 sequences were clustered into 9101 clusters.38,381 archaea sequences were reduced to 16,064 non-redundant sequences,a reduction of 58%.The fungal and bacterial datasets were reduced by 13%and 6%respectively.It took Gclust less than two hours to cluster the 2-GB archaeal dataset.For the largest bacterial dataset,Gclust took 138.11 h on a single computer with 16 threads.
A comparison between Gclust and BLASTclust is shown in Table 4.Four smaller datasets,which contained subsets of the viral,archaeal,fungal and bacterial genomic data,were used to test the efficiency of Gclust relative to BLASTclust.Smaller datasets were used for this comparison to accommodate the long running time of BLASTclust.Our tests showed that Gclust was more than 35 times faster in the viral subset,more than 40 times faster in the archaeal subset and more than 300 times faster in the bacterial subset,and generated fewer clusters in all subsets except for the fungal subset.BLASTclust could not process the fungal subset since the longest sequence was beyond the limit of BLASTclust(Table 4).
Gclust applies a parallel strategy that is similar to that introduced by the multi-threaded version of CD-HIT[16].Using the viral and archaeal genomic datasets,we tested the parallelization of Gclust when using multiple compute cores(Figure 2).The greedy incremental clustering procedure used by Gclust(see Method)is intrinsically sequential,so it is not feasible to reach linear speedup with parallelization.Here,Gclust is able to achieve an eightfold speedup with 16 cores.
Minimal MEM length is a key parameter in Gclust and affects both running time and the number of clusters found.Thedefault minimal MEM length in Gclust is21.Theselectionof thisdefault valueis described in the Method.Here,using the viral genomic dataset,we tested different MEM lengths from 13 to 40 in gapped and non-gapped extension cases(Figure 3).
Table 3 Clustering results and performance of Gclust at 90%eMEMi using 16 threads
Table 4 Comparison of BLASTclust and Gclust
Figure 2 The speedup of parallel Gclust for the viral and archaeal genome datasets
In non-gapped extension cases,a sequence is rejected if its alignment score is very low,and this is much faster than gapped extension(Figure 3A).Given the same minimum MEM length,the number of clusters in gapped extension cases is always smaller than in non-gapped extension cases since the algorithm identifies more redundant sequences with gapped extension(Figure 3B).
In Gclust,finding MEMs is the most time-consuming step.We therefore adapted a fast,lightweight suffix array sorting algorithm and modif ied the search algorithm to find MEMs.To evaluate its effectiveness,we compared Gclust and MUMmer3 in finding MEMs.In all test cases,Gclust was considerably faster than MUMmer3(Table S2).
When suffix array requires too much memory,sparse suffix arrays(that use a sparse stepK)are usually used to reduce memory demand by sacrificing the accuracy of clustering.With a higherK,some redundant sequences might be missed.
However,for larger MEMs,especially given a higher clustering threshold(e.g.,100%eMEMi),sparsestepssignificantly reduce the total clustering time without sacrificing accuracy.Wetested theperformanceof Gclust with different sparsesteps at 100%eMEMi using viral and fungal genomes(Table S3)and observed shorter runtime with largerK(≤4).The number of clusters was consistent across all sparse steps.
In thispaper,wepresent an open sourceprogram for clustering microbial genomic sequences.This algorithm provides many optionsfor usersto control theclustering process,for example,theoptimal sparsestep parameterK.Weshow that our method is efficient for large-scale genomic sequences with high accuracy.We expect that our parallelization strategy can be further optimized and improved to achieve better scalability.
Figure 3 Comparison of running time and the number of clusters with different minimal MEMs
Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust.A web server for clustering useruploaded genomes isavailable at http://niulab.scgrid.cn/gclust.The four datasets for viral,archaea,fungi,and bacteria were deposited in Ref Seq of NCBI and can be accessed at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/.
BN and WL conceived the idea and designed the study.RL performed data analysis.BN and WL analyzed the results and drafted the manuscript.CD built the webserver.RL and HZ contributed the code debugging.XH,XYL,WC,XL,DZ,YZ,ZL,and XC edited and revised the manuscript.TN,YZ,RC,and RH provided technical support for the test environment.XH designed and drew the figures.All authors read and approved the final manuscript.
The authors declare that no competing interests exist.
This work was supported by the National Key R&D Program of China(Grant Nos.2018YFB0203903,2016YFC0503607,and 2016YFB0200300),the National Natural Science Foundation of China(Grant Nos.31771466 and 61702476),and the Transformation Project in Scientific and Technological Achievements of Qinghai Province,China(Grant No.2016-SF-127).This study was also supported by the Special Project of Informatization(Grant No.XXH13504-08),the Strategic Pilot Science and Technology Project (Grant No.XDA12010000),and the 100-Talents Program(awarded to BN)of the Chinese Academy of Sciences,China.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2018.10.008.
Genomics,Proteomics & Bioinformatics2019年5期