Jian Zhao, Yan Li, Cong Wang, Haotian Zhang, Hao Zhang,Bin Jiang, Xuejiang Guo,*, Xiaofeng Song
1 Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2 State Key Laboratory of Reproductive Medicine, Department of Histology and Embryology, Nanjing Medical University,Nanjing 211166, China
3 Center of Pathology and Clinical Laboratory, Sir Run Run Hospital, Nanjing Medical University, Nanjing 211166, China
4 College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
KEYWORDS IRES;Circular RNA;Long non-coding RNA;Eukaryotic IRES;Viral IRES
Abstract Internal ribosome entry sites (IRESs) are functional RNA elements that can directly recruit ribosomes to an internal position of the mRNA in a cap-independent manner to initiate translation.Recently,IRES elements have attracted much attention for their critical roles in various processes including translation initiation of a new type of RNA,circular RNA(circRNA), with no 5′ cap to support classical cap-dependent translation. Thus, an integrative data resource of IRES elements with experimental evidence will be useful for further studies. In this study, we present IRESbase,a comprehensive database of IRESs,by curating the experimentally validated functional minimal IRES elements from literature and annotating their host linear and circular RNAs. The current version of IRESbase contains 1328 IRESs, including 774eukaryotic IRESs and 554viral IRESs from 11 eukaryotic organisms and 198 viruses,respectively.As IRESbase collects only IRES of minimal length with functional evidence,the median length of IRESs in IRESbase is 174 nucleotides. By mapping IRESs to human circRNAs andlong non-coding RNAs (lncRNAs), 2191 circRNAs and 168 lncRNAs were found to contain at least one entire or partial IRES sequence.IRESbase is available at http://reprod.njmu.edu.cn/cgi-bin/iresbase/index.php.
Translation initiation is a crucial and highly regulated step during protein synthesis in eukaryotes. Generally, eukaryotic mRNAs recruit ribosomes to initiate translation by a canonical cap-dependent scanning mechanism, which requires a methylated cap structure at its 5′end and is mediated by the eukaryotic translation initiation factors (eIFs). Instead of the cap-dependent manner,a subset of eukaryotic mRNAs can initiate an internal translation through specialized sequences,called internal ribosome entry sites(IRESs),which can directly recruit 40S ribosomal subunits to initiate translation.
The IRES elements were first discovered about 30 years ago in picornaviruses[1],followed by the discovery of novel IRESs in many other pathogenic viruses,e.g., encephalomyocarditis virus, foot-and-mouth disease virus, and human immunodeficiency virus[2–4].The IRES elements ensure the efficient viral translation,when canonical eIFs necessary for mRNA recruitment are inhibited in infected cells. Besides the viral mRNAs,IRESs were also found in a subset of eukaryotic mRNAs,e.g.,XIAP,TP53,EIF4G2, andAPAF1[5–8]. The IRESs in eukaryotic mRNAs allow their translation to continue under many physiologically stressful conditions when the capdependent translation is suppressed. In addition, for those eukaryotic IRES-containing mRNAs with long and highly structured 5′-UTRs (incompatible with the cap-dependent scanning mechanism),the IRES elements ensure their efficient translation under normal physiological conditions even when cap-dependent translation is fully active [9]. The increasing number of discovered eukaryotic IRESs suggests that the IRES-driven initiation mechanism accounts for a significant proportion of eukaryotic mRNA translation. Although the studies of IRESs have been limited, eukaryotic mRNAs bearing IRESs seem to play important roles in diverse processes,including stress response, cell survival, apoptosis, mitosis,tumorigenesis, and progression [10–14].
Two IRES databases, IRESdb and IRESite, were built in 2002 and 2005,respectively[15,16].In addition,the Rfam database collected IRESs as a type ofcis-regulatory RNA element[17].However,these databases only include IRESs in mRNAs.Recently,the IRES elements have also been found in the circular RNAs(circRNAs)and long non-coding RNAs(lncRNAs)[18].The circRNAs,circ-FBXW7,circ-ZNF609,circ-SHPRH,circPINTexon2,andcircβ-catenin,were found to be translated into polypeptides, mediated by IRES elements [19–23]. The translation of two small open reading frames (ORFs; 120 nucleotides and 141 nucleotides long)within the lncRNAmeloewas achieved through an IRES-dependent mechanism [24].Identification of the translation of circRNAs and lncRNAs driven by IRES elements has attracted more attention recently.
The IRESdb database only has 30 viral IRESs and 50 eukaryotic IRESs, and the IRESite database contains 125 IRESs in total from 43 viruses and 70 eukaryotic mRNAs. The Rfam database builds 32 RNA families with about 11 viral IRESs and 21 eukaryotic IRESs. In this study, by manual curation,we developed a novel non-redundant public database and named it IRESbase.This database contains updated experimentally validated IRESs, including 554 viral IRESs, 691 human IRESs,and 83 IRESs from other eukaryotic organisms.
In order to facilitate development of new IRES-dependent functional regulation studies,IRESbase collected rich information about human IRESs. This database includes information regarding the genomic positions,sequence conservation,single nucleotide polymorphisms (SNPs), nucleotide modifications,targeting microRNAs (miRNA), host gene, host transcripts(mRNA, circRNA, and lncRNA), GO terms (biological process, molecular function, and cellular component), KEGG pathway annotations, and validation assay information. In the database, 2191 circRNAs and 168 lncRNAs contain at least one entire or partial IRES element. Among these transcripts, 1012 circRNAs contain an ORF longer than 300 nucleotides, and all the 168 lncRNAs contain at least two small ORFs (ranging from 60 to 300 nucleotides in length).Our IRESbase database allows users to search,browse,download, and submit experimentally validated IRESs.
To construct IRESbase with high quality information, the experimentally validated IRES sequences were manually searched from literature published before Oct. 14, 2019 using PubMed with ten keywords in the titles or abstracts (Figure S1). These ten keywords are ‘‘IRES,”, ‘‘IRESs”, ‘‘internal ribosome entry site”,‘‘internal ribosome entry sites”,‘‘internal ribosomal entry site”, ‘‘internal ribosomal entry sites”, ‘‘internal ribosome entry segment”, ‘‘internal ribosomal entry segment”, ‘‘internal ribosome entry sequence”, and ‘‘internal ribosomal entry sequence”.We manually checked the abstracts and the corresponding full texts of the retrieved literature, to see whether a novel IRES is discovered and validated experimentally. For the literature concerning previously discovered IRESs, we further retrieved the original publications through their citations.For published reviews and databases of IRESs,we checked all the cited references of their collected IRESs,and located the original publications that discovered and validated these IRESs.The sequences of experimentally validated IRESs together with experiment information essential for the validation were extracted.The validation experiment information includes bicistronic backbone, positive and negative controls, second cistron expression, tested cell types, methods to analyze RNA structure, andin vitrotranslation experiments.Instead of collecting all the truncated 5′UTR sequences experimentally validated to contain IRES activity as IRESite did,only the shortest one was preserved in IRESbase to reduce data redundancy (Figure S2). In total, from 236 literature,1328 experimentally validated IRES sequences, including 554 viral IRESs, 691 human IRESs, and 83 IRESs from other eukaryotic organisms (mouse, fly, yeast,etc.), were collected.
As the available IRES sequence-related information varied in different literature, we manually extracted the IRES sequence using different methods (Figure S1). These methods can be roughly classified into three categories: (1) extracting IRES sequence directly from the figure of its two-dimensional (2D)structure; (2) extracting IRES sequence indirectly through the information of its host transcript and position; and (3)extracting IRES sequence by using the NCBI BLAST tool with the forward and reverse primers reported in the literature.Note that if more than one of the overlapping sequences were experimentally validated to contain IRES activity in literature,only the minimal functional sequence in them was selected(Figure S2).In addition,the IRES sequences derived from different literature were compared pair wise, and if the sequence identity was over 90%, only the shortest one was selected.
We searched and collected the verified 2D structures of IRESs(or IRES host 5′UTRs) from literature. The 3D structures of IRESs were manually searched and collected from the PDB,PDBj, and PDBe databases [25–27]. The sequences in the collected IRES 2D structures were manually extracted from the corresponding figures,and then the IRES-matched regions were calculated by sequence alignment.In addition,for all IRESs,the 2D structure and the minimal free energy(MFE)were also calculated using RNAfold of Vienna RNA package[28].
For human IRESs, genomic position, sequence conservation, SNPs, RNA modification sites, potential host circular RNA, and miRNA–IRES interaction data were provided. As shown in Figure S3, the IRES genomic positions were obtained by mapping IRES sequences to their host transcripts,and then chromosomal positions were extracted from NCBI transcript annotation files (GRCh37.p13 and GRCh38.p10)[29]. Based on the genomic positions, the phyloP scores(hg38 phyloP100way)were extracted from the UCSC database[30].The NCBI dbSNP database(150 release)was searched for the common SNPs and SNPs within the IRES sequences were flagged [31]. The RMBase v2.0 database was searched for RNA modification sites within the human IRES regions, includingN6-methyladenosine (m6A) sites,N1-methyladenosine (m1A) sites, pseudouridine (ψ) sites, and 2′-O-methylation (2′-O-Me) sites [32]. The potential miRNA–IRES interactions were predicted by miRanda(version: aug2010) with a score threshold of 150 [33].
The host genes and transcripts of IRESs were collected from the literature, if available. For IRES sequences without detailed host gene or transcript information in the literature,nucleotide BLAST in NCBI was performed to search for their host genes and transcripts with a sequence identity of at least 95% [34].For viral IRESs, the closest gene with coding sequence (CDS)downstream of an IRES element was regarded as its host gene.For human IRESs,besides mRNAs,we also aligned IRESs to lncRNAs and circRNAs, and only sequences containing an IRES fragment of at least 30 nucleotides were considered as candidate IRES host transcripts.The cutoff value(30 nucleotides)was defined according to the minimum length of known IRES elements.
For each IRES entry,mRNA and lncRNA information was extracted from the NCBI nucleotide database,and the circRNA information was extracted from circBase,circRNADb,and circAtlas [35–37]. The ORF (<300 nucleotides) and its corresponding peptide sequence were predicted for each lncRNA,and the ORF(>300 nucleotides)and its corresponding protein sequence were predicted for each circRNA.The coding potential score of ORFs was calculated by CPAT[38].The sequence conservation of ORFs in lncRNA was assessed and indicated using the mean phyloP score[30].Gene level functional information,i.e.,OMIM and Gene Ontology(GO)terms of biological process,molecular function,and cellular component,was extracted from the NCBI Gene database.The pathway information was retrieved from the KEGG PATHWAY database[39].
The IRESbase database was constructed by the assembly of all curated minimal IRES sequences and structure information, related transcript and gene information,as well as annotations. IRESbase was configured in a typical XAMPP (X-Windows+Apache+MySQL+PHP+Perl)integrated environment.
IRESbase is intended to be a comprehensive database that includes both eukaryotic and viral IRESs with experimental evidence and covers not only the information on IRES sequences and structures, but also the information on IRES host transcripts and genes. The database currently contains a total of 1328 IRES entries, including 774 eukaryotic IRES entries and 554 viral IRES entries. Each eukaryotic IRES entry contains three basic types of records–IRES,gene,and transcript records if available, whereas each viral IRES entry contains the IRES and transcript records.In total,1328 IRES records,725 eukaryotic gene records, 3633 eukaryotic transcript records (1274 mRNAs,168 lncRNAs,and 2191 circRNAs),and 588 viral transcript records are included in IRESbase.The data sources and the structure of IRESbase are shown in Figure 1.
The types of data available varied across different organisms. Human IRES records have the largest number of attributes (32 in total, if available); the IRES records for other eukaryotic organisms and viruses have 20 and 24 attributes,respectively (Figure 1). All IRES records have the following 18 attributes, including IRESbase ID, organism, sequence,length,host gene ID and symbol,validation assay information,2D structure, MFE, PubMed ID, and reference. The other attributes for human records are host gene synonym, host transcript ID and alignment result, genomic positions (location and strand), sequence conservation, SNP, modification sites,and miRNA–IRES interaction.Host gene synonym,host transcript ID and alignment result are also contained in the IRES records for other eukaryotic organisms. The viral IRES record contains six other attributes including viral genome ID(e.g., NC_006273), genome shape, virus type, genomic position, CDS of regulation, and 3D structure.
Figure 1 Data sources and the IRESbase structure
As shown in Figure 1, the eukaryotic gene records commonly have nine attributes, which are gene ID, gene symbol,gene synonym,gene note,GO biological process,GO molecular function,GO cell component,KEGG pathway,and IRESbase ID. For human, besides the nine attributes mentioned above, three other attributes (OMIM ID, OMIM phenotype,and HGNC ID) are included. The eukaryotic transcript records for IRES host mRNAs,lncRNAs,and circRNAs contain 12,13,and 11 attributes,respectively.The 12 attributes of eukaryotic mRNA records are transcript ID, name, version,sequence, length, gene symbol, IRES alignment result, CDSregion, protein ID, name, sequence, and CCDS ID. For lncRNA records, besides the first seven attributes in the mRNA record, the other six attributes are gene ID, ORF,ORF’s coding potential score, ORF’s sequence conservation,predicted protein sequence, and length. For circRNA records,the 11 attributes are circRNA ID (circBase, circRNADb, and circAtlas ID), genomic location, strand, gene symbol, best transcript, IRES alignment result, ORF, coding potential score, predicted protein sequence, protein length, and reference. The viral transcript records have 21 attributes, including NCBI locus, accession, version, virus name, organism, type,group, transcript sequence, length, shape, gene ID, name,tag,function,CDS region,protein ID,name,sequence,IRESbase ID, IRES location, and strand.
Simple, advanced, and BLAST search
IRESbase was developed in a user-friendly mode, providing a search engine to find the IRESs of interest. The homepage(Figure S4) offers a search box for querying the IRESbase by IRES ID, gene ID, transcript ID, protein ID, PubMed ID, chromosomal location, organism name, gene name, protein name, or a substring of any of the aforementioned keywords. If a keyword (e.g.,FUBP1or HCMV) is added into the search bar, the results will be shown in a tabular format.By clicking on the IRES ID (e.g.,hsa_ires_00089.1orvir_-ires_00484.1),the detailed information is shown in three parts:the IRES information (Figure 2), its host gene information,and its host transcript information (Figure 3). Furthermore,terms in the page are cross-linked to several external databases including NCBI, UCSC, AmiGO, KEGG, HGNC, RMBase,circBase, circRNADb, and circAtlas [30–32,35–37,39–41].
In addition,IRESbase also provides three advanced search options for users, including (1) BLAST search, (2) Advanced search, and (3) Chromosomal location.
(1) BLAST search:this option was designed to find if there is a similar IRES in the RNA sequence of interest. It provides four IRES sequence datasets (e.g., human, other eukaryotic organisms,virus,and all organisms)for users.Users can input more than one query sequence in FASTA format.The search results are summarized in a table containing job title, IRES ID, score, expect, identity, gaps,and strand(Figure 4).By clicking the‘Job title,’the alignment detail is shown at the nucleotide level.
(2) Advanced search: in this option, users can use one or more query fields together to retrieve the IRESs of interest. Different query fields are provided for users to search IRESs of different organisms, including virus,human, and other eukaryotic organisms (Figure 1).
(3) Chromosomal location: this option was designed for users to find if there is a verified IRES located in the circRNAs of interest through the input of a specific chromosomal region in humans (hg19 or hg38).
Data browse, download, and submission
Instead of searching for a specific IRES, all entries of IRESbase were grouped by organism, GO & KEGG, chromosome,and transcript type. Users can browse the IRES group of interest by clicking the count on the left of the corresponding page. Unlike other existing IRES databases, IRESbase supports users to download all the IRES records together in a simple tab-delimited format or to download each IRES record separately in a standard PageTab(page layout tabulation)format. The PageTab format file only contains the basic IRES information and its validation assay information. In addition,users can also select the IRES attributes of interest to download in an Excel-compatible format.
IRESbase allows and encourages users to submit novel experimentally validated IRESs. For the submission, users should provide the basic IRES information (e.g., IRES sequence, organism, host gene ID, host gene name, and host transcript ID) and validation assay information, including bicistronic backbone, positive and negative controls, second cistron expression, tested cell types, methods to analyze RNA structure,in vitrotranslation experiments, and PubMed ID.In addition,users can also submit a novel verified IRES by emailing a PageTab format file,in which all the essential information should be assembled. The submitted record will be included in IRESbase after manual check by our curators.
The current version of IRESbase contains the largest collection of experimentally validated IRESs in 11 eukaryotic organisms and 198 viruses.When compared with the existing IRES databases, IRESbase contains the largest number of eukaryotic IRESs and viral IRESs, including 1328 IRESs, while Rfam,IRESdb, and IRESite only contains about 32, 80, and 135 IRESs, respectively (Figure 5A). In particular, IRESbase is the first database containing IRESs discovered in circRNAs and lncRNAs (Figure 5C). Moreover, IRESbase has minimal data redundancy. Unlike the IRESite database which collects all the truncated overlapping 5′UTR sequences validated to contain IRES activity,IRESbase only stores the shortest functional sequences as the core IRES region. The median length of IRESs in IRESbase is 174 nucleotides, which is much shorter than that in IRESite (442 nucleotides; Figure 5B).
In addition, IRESbase provides richer information and more useful functions than other IRES databases. Besides the experimentally verified 2D structure of 56 IRESs, IRESbase also collected the 3D structure of 12 IRESs from literature. Moreover, IRESbase predicted 2D structure for all of the IRESs with Vienna RNA package[28].IRESbase also provides more annotations,including genomic positions,sequence conservation, SNPs, nucleotide modifications, and targeting miRNAs for human IRESs. And apart from the IRES host transcripts collected from literature, IRESbase also provides many other potential IRES host transcripts (mRNAs,lncRNAs, and circRNAs) predicted by sequence identity.Compared with existing databases, IRESbase provides more query fields (e.g., genomic location and gene synonym) and search methods (fuzzy, advanced, and BLAST search). Additionally, the entire IRESbase dataset or the matched specific terms can be easily batch downloaded from IRESbase.
Figure 2 Entry page for two IRES examples (hsa_ires_00089.1 and vir_ires_484.1)
In addition to 5′UTRs, IRESs are also found in the CDS regions[42].In the 691 human IRESs,60%IRESs are located only in the 5′UTR regions,27%IRESs are located across the start codon, and 4% IRESs are found only in CDS regions(Figure 5D). To investigate the functions of human IRESs,the gene set enrichment analysis was performed for the human IRES host genes using Metascape (www.metascape.org) [43].As shown in Figure 5E, the most significant term is‘hsa05200: Pathways in cancer.’ The enriched terms indicate that human IRESs may play important roles in cancer,cellular stress response, virus infection, and cell signaling.
Figure 3 Gene and transcript records for a human IRES entry (hsa_ires_00089.1)
As early as in 1995, IRESs in mRNAs were proved to be effective in initiating the protein synthesis of artificial circRNAs [44]. In recent years, IRESs in circRNAs are also demonstrated to initiate the translation of linear transcripts(e.g., bicistronic transcripts) [19–23]. These studies indicate that the activity of IRES is not RNA type-specific and IRES can play roles in any kinds of host RNAs.Currently,only nine circRNAs and one lncRNA have been found to be translated via IRES.Therefore,in order to identify more potential coding circRNAs and lncRNAs,mRNA-derived IRESs were mapped to human circRNAs and lncRNAs by using the pairwise sequence alignment. The result shows that 371 circRNAs and 83 lncRNAs contain full IRES sequences;1777 circRNAs and 77 lncRNAs contain partial IRES sequence (at least 30 nucleotides). Furthermore, circRNA- and lncRNA-derived IRESs were also mapped to mRNAs. Consequently, eight novel mRNAs are discovered to contain IRESs.
Figure 4 Result page of BLAST search exemplified with hsa_ires_00089.1
In order to investigate the potential coding circRNAs and lncRNAs, ORFs were predicted for the circRNAs and lncRNAs containing a full or partial IRES sequence. Of the circRNAs with an entire mRNA-derived IRES element, 261 contain an ORF (>300 nucleotides); of the circRNAs with a partial mRNA-derived IRES element, 722 contain an ORF(>300 nucleotides).All of lncRNAs were predicted to contain small ORFs (longer than 60 nucleotides and shorter than 300 nucleotides). And in 61 lncRNAs, the IRESs were found to be located between two small ORFs,which indicates that these lncRNAs may encode multiple small peptides.In addition,the coding potential score was calculated for the circRNAs and lncRNAs,and the results suggest that 779 circRNAs and three lncRNAs are able to encode proteins and small peptides,respectively.
In recent years, as a functional RNA element which recruits ribosome to initiate translation in a cap-independent manner,IRES has been shown to mediate circRNA and lncRNA translation[18,24,45].Although only a few peptides encoded by circRNAs have been discovered, these peptides play important roles in various biological processes [19–23]. Besides the presence of ORF, coding ability of circRNAs further depends on the presence of IRES element. However, so far only a few IRESs are available in existing IRES databases,and the translation initiation mechanism via IRES is still elusive, especially for cellular IRESs[46].Therefore,there is an urgent need for a comprehensive curated dataset of all experimentally validatedIRES elements, which could be a rich resource for functional studies and provides clues for protein coding ability of circRNAs and lncRNAs.
Figure 5 Statistics and analysis of human IRESs
In the near future, we will collect and update more IRES related information,including regulatory proteins,critical sites or regions for their activity,and experimentally verified biological function.We also plan to expand the database with experimentally validated sequences without IRES activity, which can not only help prevent repeated validations, but also facilitate the development of bioinformatics tools for the identification of IRES elements. Finally, as an open-access IRES database, we hope researchers can not only use its content but also help us update the database with information on new IRESs and provide us with a feedback.
IRESbase is freely accessible at http://reprod.njmu.edu.cn/cgibin/iresbase/index.php.
Jian Zhao:Methodology, Data Curation, Formal analysis,Visualization, Writing - Original Draft.Yan Li:Investigation,Data Curation.Cong Wang:Methodology, Software.Haotian Zhang:Methodology,Software.Hao Zhang:Investigation.BinJiang:Resources.Xuejiang Guo:Conceptualization, Supervision,Writing-Review and Editing,Funding acquisition.Xiaofeng Song:Conceptualization, Supervision, Writing - Review and Editing, Funding acquisition. All authors read and approved the final manuscript.
The authors have declared no competing interests.
The study was supported by the National Natural Science Foundation of China (Grant No. 61973155), the National Key R&D Program of China (Grant No. 2016YFA0503300),the National Natural Science Foundation of China (Grant Nos. 61571223, 81971439, and 81771641), the Program for Distinguished Talents of Six Domains in Jiangsu Province(Grant No. YY-019), the Fok Ying Tung Education Foundation (Grant No. 161037), the Fundamental Research Funds for the Central Universities (Grant No. NP2018109), the Natural Science Foundation of the Jiangsu Higher Education Institutions (Grant No. 18KJB310006), and the Scientific Research Foundation of Nanjing Medical University (Grant No. 2242019K3DN02), China.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.03.001.
0000-0001-7857-4766 (Zhao, J)
0000-0003-0420-4222 (Li, Y)
0000-0003-4669-9702 (Wang, C)
0000-0002-7181-0962 (Zhang, H)
0000-0001-7043-5438 (Zhang, H)
0000-0002-1156-2557 (Jiang, B)
0000-0002-0475-5705 (Guo, X)
0000-0001-7445-4302 (Song, X)
Genomics,Proteomics & Bioinformatics2020年2期