MakeHub:Fully Automated Generation of UCSC Genome Browser Assembly Hubs

2019-03-07 07:27:40KatharinaJasminHo

Genomics,Proteomics & Bioinformatics 2019年5期

Katharina Jasmin Hoあ

1 University of Greifswald,Institute for Mathematics and Computer Science,17489 Greifswald,Germany

2 University of Greifswald,Center for Functional Genomics of Microbes,17489 Greifswald,Germany

KEYWORDS Genome annotation;Annotation visualization;RNA-seq;Genome browser

Abstract Novel genomes are today often annotated by small consortia or individuals whosebackground is not from bioinformatics.This audience requires tools that are easy to use.Such need has been addressed by several genomeannotation tools and pipelines.Visualizing resulting annotation is a crucial step of quality control.The UCSC Genome Browser is a powerful and popular genome visualization tool.Assembly Hubs,which can be hosted on any publicly available web server,allow browsing genomes via UCSC Genome Browser servers.The steps for creating custom Assembly Hubs are well documented and the required tools are publicly available.However,the number of steps for creating a novel Assembly Hub is large.In some cases,the format of input f iles needs to be adapted,which is a diff icult task for scientists without programming background.Here,we describe MakeHub,a novel command line tool that generates Assembly Hubs for the UCSC Genome Browser in a fully automated fashion.The pipeline also allows extending previously created Hubs by additional tracks.MakeHub is freely available for downloading at https://github.com/Gaius-Augustus/MakeHub.

Introduction

With decreasing sequencing costs,sequencing the genomes of non-model organisms that are of interest to individuals or small research consortia has become affordable.Pipelines and tools that enable scientists with diverse backgrounds to easily annotate protein-coding genes in novel genomes have been developed and arefrequently used,for example,thetools AUGUSTUS[1—5],GeneMark-ES/ET[6—8],GlimmerHMM[9],SNAP[10],and GeMoMa[11—13],as well as the pipelines BRAKER[14,15],WebAUGUSTUS[16],and MAKER[17,18].The output of such gene prediction tools and pipelines is in a table-like text f ile in gene transfer format(GTF)or general feature format 3(GFF3).Visualization of predicted gene structures in context with available extrinsic evidence is a crucial step of quality control in any genome annotation project[19].A number of genome browsers are available for this task,for example,the UCSC Genome Browser[20],JBrowse[21],and GBrowse2[22].While JBrowse and GBrowse2 require installation of the browser software on a server or on a local computer,the UCSC Genome Browser bypasses the requirement for software installation.Instead,it offers the opportunity of visualizing any genome through so-called locally hosted‘Assembly Hubs’combined with existing UCSC Genome Browser servers(e.g.,at https://genome.ucsc.edu)[23].

An Assembly Hub is simply a directory that contains configuration f ilesrequired by the UCSC Genome Browser aswell as track data f iles with the data to be visualized.The steps for creating custom Assembly Hubs are well documented(http://genomewiki.ucsc.edu/index.php/Assembly_Hubs) and the required tools are publicly available.An experienced bioinformatician is able to create Assembly Hubs with ease.However,a scientist with limited programming background may find it troublesome to manually create the required configuration f iles,to adapt the output of gene prediction pipelines to the demands of UCSC tools for creating data tracks,and to run all required tools in the correct order.

Recently,the workf low G-OnRamp for fully automated generation of UCSC Assembly Hubs(and JBrowse instances)through the usage of Galaxy web forms became available[24].The tool that generates UCSC Assembly Hubs within GOnRamp is called Hub Archive Creator and works seamlessly with genomeannotation output f ilesin the G-OnRamp Galaxy framework.However,Hub Archive Creator is diff icult to use as a stand-alone command line tool outside of G-OnRamp,because it relies on strict input f ile format consistency,which is ensured when bioinformatics tools are called inside of GOnRamp but is not guaranteed when using the same tools in their original release form on the command line.

Therefore,we here describe the novel command line tool MakeHub for the fully automated generation of UCSC Assembly Hubs on the command line from the output of BRAKER,MAKER,Glimmer HMM,SNAP,and GeMoMa for genomes of single species.

Implementation

The MakeHub pipeline is implemented in Python and is compatible with Linux and Mac OS X x86_64 computers.The pipeline is illustrated in Figure 1.It provides a command line interface for creating fully functional Assembly Hubs from a genome f ile(and optionally f iles with gene and evidence information)and for adding tracks to existing hubs.

Genome f iles in FASTA format are converted to twoBit format using UCSC’s faToTwoBit.Chromosome(or contig)sizesthat serveasinput to toolsfor thelater creation of bigWig and bigBed f iles are extracted using UCSC’s twoBitInfo.GCcontent information iscollected from thegenomewith UCSC’s hgGcPercent and written into wiggle(WIG)format.Subsequently,the WIG f ile is converted to bigWig format using UCSC’s wigToBigWig[25].A cytoband track that shows the location of the browser window in a target sequence is generated using the same twoBitInfo output and bed ToBigBed with an automatically obtained cytoBand AutoSQL f ile.The genome is screened for softmasked repeat information by Make-Hub and repeats are written to browser extensible data(BED)format,sorted,and then converted to bigBed format with UCSC’s bed ToBigBed.

Hub configuration f iles,most prominently the hub/hub.txt f ile,are created and initialized upon completion of genome processing.

Figure 1 Illustration of the MakeHub pipeline

Transcriptome alignment f iles in binary alignment map(BAM)format can be visualized in two ways.By default,BAM f iles are sorted with SAMtools[26]and converted to WIG format.This conversion step can be performed either with the AUGUSTUS tool bam2wig[27]or,in its absence,with SAMtools and built-in MakeHub functionality.WIG f iles are subsequently converted to bigWig format as described above.This generates tracks that allow for an intuitive interpretation of gene structures in context with RNA-seq coverage information.

Optionally,BAM f iles can be displayed from native BAM format.The required SAM index is automatically generated with SAMtools.Viewing native BAM f iles gives immediate access to alignment quality information of single reads.

MakeHub seamlessly integrates with output f iles of the popular genome annotation tools and pipelines AUGUSTUS,GeneMark-ES/ET,Glimmer HMM,SNAP,BRAKER,MAKER,and GeMoMa.GTF and GFF3 f iles of these tools and pipelines are standardized to a UCSC-compatible GTF format by MakeHub.Subsequently,UCSC’s gtf ToGenePred is used to convert the GTF f ile to GenePred data,which is checked for consistency by UCSC’s genePred Check and passed to genePredToBigGenePred and bed ToBigBed,generating tracks that allow browsing predicted proteins at the amino acid level.If not available,genePred ToBigGenePred is automatically replaced by genePred ToBed,generating a track that does not allow browsing amino acids in predicted genes but still visualizes gene structures.In both cases,the final output is in bigBed format.

MakeHub accepts the output directory of a BRAKER run asan input argument and automatically identifiesthegeneprediction f iles for visualization in that directory(alternatively,AUGUSTUS and GeneMark-ES/ET predictions can be passed as argumentsto separateoptions).MakeHub automatically extracts gene models from the MAKER GFF3 output f ile(it usually contains evidence for gene models as well).MakeHub accepts the native GFF3 output f ile of GeMoMa and Glimmer HMM,and the output of SNAP’s zff2gff3.pl script.

Visualization of the evidence that goes into gene model inference is crucial.This evidence often exceeds the information that can be seen in a RNA-seq WIG or BAM track.For example,annotators are often interested in viewing splice junctions from RNA-seq and/or protein alignments with coverage and strand information in a concise overview.On the other hand,alignments from cDNA,assembled transcriptomes,and proteins need to be visualized in a gene-structurelike fashion.MakeHub automatically generates suitable tracks with evidence from MAKER output and from BRAKER hint f iles in GFF format.Gene-structure-like evidence(e.g.,full length protein alignments)is visualized similarly as gene models,while other evidence,such as splice junctions,is visualized as segments.All resulting evidence track data f iles are in the indexed bigBed format.

MakeHub automatically generates HTML templatef ilesfor describing a hub and itstracks.Thesef ilesarerequired for public hubs(http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines,February 10th 2019).Thesepagesshould beedited to appropriately describe genome projects and individual tracksbefore adding a hub to the list of public hubs at UCSC.

Automatically generated Assembly Hubsmust be copied to a publicly available web server for deployment.The hyperlink to hub.txt can be provided to the UCSC Genome Browser for data visualization(see instructions in Figure 1).

Conclusion

In summary,MakeHub is a command line tool that enables scientists with little experience in bioinformatics to generate Assembly Hubs of their genomedata and annotations of interest with ease.

Availability

MakeHub is freely available for downloading at https://github.com/Gaius-Augustus/MakeHub.

Author’s contributions

KJH developed the idea,implemented the software,wrote the manuscript,and approved the final version.

Competing interests

The author has declared no competing interests.

Acknowledgments

The international collaboration between the groups of Mark Borodovsky and Mario Stanke,supported by US National Institutes of Health(Grant No.HG000783),gave rise to the development of MakeHub.This project wasfunded by Universita¨t Greifswald,Germany.I thank Mario Stanke,Matthis Ebel,and Malte Wellnitz for proofreading.

Genomics,Proteomics & Bioinformatics2019年5期

Genomics,Proteomics & Bioinformatics的其它文章: Genomics,Proteomics&Bioinformatics; VPOT:A Customizable Variant Prioritization Ordering Tool for Annotated Variants; shinyChromosome:An R/Shiny Application for Interactive Creation of Non-circular Plots of Whole Genomes; CircAST:Full-length Assembly and Quantification of Alternatively Spliced Isoformsin Circular RNAs; CIRCexplorer3:A CLEAR Pipeline for Direct Comparison of Circular and Linear RNA Expression; I3:A Self-organising Learning Workf low for Intuitive Integrative Interpretation of Complex Genetic Data