Characterizing RNA Pseudouridylation by Convolutional Neural Networks

2021-06-07 07:44:38XuanHeSaiZhangYanqingZhangZhixinLeiTaoJiangJianyangZeng
Genomics,Proteomics & Bioinformatics 2021年5期

Xuan He, Sai Zhang, Yanqing Zhang, Zhixin Lei, Tao Jiang,Jianyang Zeng,7,*

1 Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China

2 Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China

3 Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China

4 Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA

5 MOE Key Lab of Bioinformatics and Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

6 Institute of Integrative Genome Biology, University of California, Riverside, CA 92521, USA

7 MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China

KEYWORDS Pseudouridylation;Convolution neural network;Sequence motif;Translation;RNA stability

Abstract Pseudouridine (Ψ) is the most prevalent post-transcriptional RNA modification and is widespread in small cellular RNAs and mRNAs.However,the functions,mechanisms,and precise distribution of Ψs (especially in mRNAs) still remain largely unclear. The landscape of Ψs across the transcriptome has not yet been fully delineated.Here,we present a highly effective model based on a convolutional neural network (CNN), called PseudoUridyLation Site Estimator (PULSE), to analyze large-scale profiling data of Ψ sites and characterize the contextual sequence features of pseudouridylation. PULSE, consisting of two alternatively-stacked convolution and pooling layers followed by a fully-connected neural network,can automatically learn the hidden patterns of pseudouridylation from the local sequence information. Extensive validation tests demonstrated that PULSE can outperform other state-of-the-art prediction methods and achieve high prediction accuracy, thus enabling us to further characterize the transcriptome-wide landscape of Ψ sites. We further showed that the prediction results derived from PULSE can provide novel insights into understanding the functional roles of pseudouridylation,such as the regulations of RNA secondary structure,codon usage,translation,and RNA stability,and the connection to single nucleotide variants. The source code and final model for PULSE are available at https://github.com/mlcb-thu/PULSE.

Introduction

Pseudouridine(Ψ)is known as the most abundant and earliest discovered modified ribonucleoside among more than 100 types of RNA post-transcriptional modifications that have been identified so far[1–3].Because of its prevalence in cellular RNAs, it has also been considered as the fifth ribonucleoside[1]. The properties, chemical structure, and distribution of a Ψ are quite different from those of its parental base (i.e., uridine) [4]. Compared to a uridine, the chemical conformation of a Ψ allows the formation of an extra hydrogen bond at its non-Waston-Crick edge [5]. This fact indicates that a Ψ can form a more stable base stacking conformation [6], which is believed to play an important role in stabilizing RNA structure [7]. In ribosomal RNAs (rRNAs), it has been found that Ψs are required for the maintenance of ribosome–ligand interactions and translational fidelity[8].In addition,Ψs in transfer RNAs(tRNAs)can reduce the conformational mobility of the structural elements around the modified sites and thus affect the amino acid transfer efficiency[9].Such a stabilization function of Ψs in anticodon stem-loop of tRNALys,3has also been validated by Nuclear Magnetic Resonance (NMR) spectroscopy [10]. Moreover, Ψs in spliceosomal RNAs can be involved in the ribonucleoprotein (RNP) assembling and premRNA splicing processes [11]. The aforementioned findings indicate that most likely Ψs are tightly related to RNA structure stabilization, translation process, and RNA stability.Despite these observations, the underlying mechanisms of pseudouridylation in the aforementioned processes still remain to be further explored.

The conversion from a uridine to a Ψ is catalyzed by Ψ synthases (PUSs) through two distinct processes, including RNA-dependent and RNA-independent operations [12]. The RNA-dependent pseudouridylation process relies on the box H/ACA RNPs, which consist of a small box H/ACA RNA and four core proteins, including centromere-binding factor 5(Cbf5; also known as dyskerin in mammals), non-histone protein 2 (Nhp2), glycine-arginine-rich protein 1 (Gar1), and nucleolar protein 10 (Nop10), to form a pseudouridylation pocket for substrate recognition and catalytic activity [13]. In the RNA-independent pseudouridylation process,a single synthase protein, such as PUS7, is responsible for both substrate recognition and pseudouridylation catalysis[12].So far,about 13 types of PUSs in human have been identified,and generally it is difficult to unveil the consensus catalytic laws of pseudouridylation [14]. Moreover, it has been shown that pseudouridylation in RNAs is highly dynamic and inducible [12],which makes it even harder to characterize the properties of pseudouridylation. On the other hand, RNA modification is mostly a sequence pattern recognition process, as it is heavily dependent on the sequence binding preferences of catalytic proteins[15].From this point of view,it is reasonable to speculate that RNA pseudouridylation is determined by the sequence contexts of the sites being modified.

To characterize the intrinsic properties of Ψs, we need to develop efficient methods to accurately identify Ψ sites at single-base resolution and obtain a transcriptome-wide map of Ψs. Traditional Ψ detection methods are mainly based on the N3-CMC labeling and gel electrophoresis experiments[16],which are often laborious and time-consuming.Recently,several high-throughput profiling techniques, including Pseudo-seq [17], Ψ-seq [18], pseudouridine site identification sequencing (PSI-seq) [19], and N3-CMC-enriched pseudouridine sequencing (CeU-seq) [20], have been proposed to map RNA Ψ sites to reference transcriptomes. These highthroughput experiments typically combine CMC derivatives with next-generation sequencing or deep sequencing techniques to detect Ψ sites on a transcriptome-wide scale. However, these experiments are generally costly and often require tremendous time and effort in deriving the positions of Ψ sites.In addition,although recent high-throughput sequencing techniques,such as Ψ-seq and CeU-seq,have been able to identify large-scale Ψ sites in mRNAs, they may still miss numerous modification sites due to their intrinsic limitations (e.g., the incompleteness of CMC-labeling or the read mappability issue). Therefore, efficient computational approaches to identify Ψs accurately are especially needed to complement these experimental methods and facilitate the studies of pseudouridylation. The computational prediction of transcriptome-wide Ψ sites and characterization of their sequence contexts may also provide important hints in understanding the functional roles of pseudouridylation in RNA regulation. Although several computational approaches and web servers, such as PPUS [21] and iRNA-PseU [22], have been developed to predict novel Ψ sites, they either can only be applied to predict PUS-specific sites (i.e., can only predict PUS4-specific sites for human) or need to use the handcrafted features derived from the chemical properties of nucleotides.

Recently, deep learning techniques, especially convolutional neural networks (CNNs), have been widely used in genomic data analyses for extracting accurate sequence features [23–25]. CNNs were first developed for handwriting recognition and face identification [26], and have become one of the most famous and powerful learning models in the fields of computer vision, speech recognition, and natural language processing [27–29]. Despite its powerful predictive capacity,it remains unknown whether a CNN model can be used to effectively capture the contextual sequence features of pseudouridylation and accurately predict new Ψ sites.

In this study, we have developed a computational framework, called PseudoUridyLation Site Estimator (PULSE), to predict novel Ψ sites from large-scale profiling data of Ψs based on the sequence contexts of target sites. To our knowledge, our study is the first deep learning-based attempt to characterize the contextual sequence features of pseudouridylation by fully exploiting the currently available large-scale profiling data of Ψs. PULSE employs a CNN model, which contains two alternately-stacked convolution and pooling layers responsible for local feature extraction from the input contextual sequences and two fully-connected layers responsible for feature integration and estimation of the pseudouridylation potential of a candidate site. Tests on both human and mouse data have demonstrated that PULSE can achieve high prediction accuracy and significantly outperform other state-of-the-art prediction approaches. The new sequence features captured by PULSE are not only consistent with the recognized motifs of known PUSs,but also match the binding patterns of several nucleotide-binding proteins, which may provide useful hints for discovering new potential PUSs or associating proteins. In addition, the underlying sequence contexts of Ψs detected by PULSE offer an effective indicator to investigate the functional effects of single nucleotide variants(SNVs)on pseudouridylation,which may help reveal possible associations between pseudouridylation and complex diseases. Moreover, the trained PULSE model allows us to unveil the transcriptome-level characteristics of pseudouridylation. The prediction results of PULSE provide several new insights about the functional roles of pseudouridylation. For example, pseudouridylation is codon biased and rare codons are more likely to be pseudouridylated to achieve optimal mRNA translation. Also, our integrative analysis of ribosome profiling data demonstrated that pseudouridylation is involved in modulating the translation initiation and elongation processes. These results indicated that the predictions of PULSE may shed light on the underlying mechanisms and functional roles of pseudouridylation in post-transcriptional regulation.

Method

Data collection and pre-processing

The Ψ modification site data were downloaded from the RMBase database [30] which includes the high-throughput profiling data of Ψs collected from three recent experimental studies [17,18,20]. All the labeled Ψ sites were separated into a human dataset and a mouse dataset.In addition,the overlap between human and mouse datasets which represents the conserved Ψ sites was considered as a relatively reliable dataset for further model validation. Moreover, the Ψ sites which were identified by SCARLET from the recent experimental study[20] were also used for assessing the prediction accuracy of PULSE. All of the aforementioned modification sites were mapped to the reference genome (human: hg19; mouse:mm10)and those that cannot be mapped to thymines were discarded.A sequence of 101-nt length that covers the Ψ site and has a 50-nt window flanking on its both sides was labeled as a positive sample, while the sequence of the same length that is centered at a thymine that is closest to a corresponding Ψ site and does not have any overlap with any positive sample was labeled as a negative sample. In the end, we collected 7720 and 6166 samples in total for human and mouse, respectively.The ratio of positive and negative samples was close to 1(human: 3901 positive samplesvs. 3819 negative samples;mouse: 3057 positive samplesvs. 3109 negative samples). The sequence samples were then encoded into binary matrices as the input to our model using the one-hot encoding scheme.For the imbalanced testing datasets with 1:npositive-tonegative ratio(PNR),the positive samples were collected using the same way as we described above, while the negative samples were collected from the nearestnthymine sites that have no overlap with any positive samples.

Model design

We have designed a computational pipeline to fully characterize pseudouridylation (Figure 1A). To encode the contextual sequence features of a potential Ψ site of interest, we first extend the target site both upstream and downstream by 50 nt and then use a simple four-dimensional binary vector to encode each nucleotide (Figure 1A; File S1). Then, the encoded matrix of an input contextual sequence is fed into a particularly-designed CNN model to capture the latent features of the potential sequence determinants of a Ψ site. Our CNN model consists of two alternately-stacked convolution and pooling layers followed by a two-layer fully-connected network (Figure 1B; File S1). In particular, the convolution kernels from the convolution layers scan the input matrix that encodes the input sequence profiles and capture intrinsic hidden features about the local contextual patterns of the target site. The last fully-connected layer (also called the output layer)employs a softmax function to perform the classification task.

Figure 1 Overview of the PULSE pipeline

Overall, for a given sequencel, the information flow of PULSE can be abstracted into the following equation:

wherelPPS(l)represents the final prediction score of the target site,convi(),actii(),pooli(), andneti()stand for thei-th convolution, neuron activation, pooling, and full-connection operations, respectively. In our framework, the length of the input sequencelis set to 101,as we extend the target site both upstream and downstream by 50 nt.We used grid search with a cross-validation procedure [31] to calibrate the hyperparameters of our CNN model (see the ‘‘Training procedure and model evaluation” section).

Training procedure and model evaluation

The core of PULSE is a CNN consists of two alternatelystacked convolution and pooling layers and two fullyconnected layers. During convolution, the input matrices of dimensionL×4 (whereLstands for the length of the input sequences) are first cross-correlated with several convolution filters and then the convolved outputs are rectified by a Parametric Rectified Linear Unit(PReLU)activation function[32].In the pooling stage, the pooling operators are applied to the previous convolution and activation results for further motif extraction.After that,the pooled results are flattened to a vector which is then fed to a fully-connected neural network for final classification. In the final setting of PULSE, the sizes of the first and the second convolution layers are set to 4×8 and 1×8, respectively, and the sizes of both pooling layers are set to 1×2. The numbers of convolution operators in the first and second layers are set to 64 and 32, respectively,and the numbers of units in the hidden layers of the fullyconnected neural network are set to 64-64-1. We apply a 10-fold cross-validation strategy to determine the best values of hyperparameters, including the filter sizes, the filter numbers,the learning rate, the dropout probability, and the number of training epoches, and evaluate the prediction performance of our model. In particular, we first randomly separate entire data into 10 folds. For each fold, the held-out 1/10 dataset is used as testing data and the remaining 9/10 is used as training data. Meanwhile, in each fold, we further divide the training data into 10 subfolds, and run another (nested) 10-fold cross-validation procedure to choose the optimal settings of hyperparameters based on the trained model over 9/10 subsets and the evaluation result on the held-out 1/10 subset of the training data. After the optimal hyperparameters are determined using this nested cross-validation procedure, all the training data are merged and used to train the model and then evaluate the prediction performance on the held-out testing data in each fold. All the hyperparameters except the number of training epoches are computed through grid search. The prediction results over the 10 folds are collected together as the final prediction results. The human and mouse datasets are used independently to train two separate PULSE models(i.e., hPULSE and mPULSE). In our model, the hyperparameters of hPULSE and mPULSE are almost the same except the number of training epoches.

PULSE is implemented based on the Keras library(https://keras.io) in Python. Back propagation is applied in the training process for efficiently updating the parameters.In addition,several optimization techniques, including stochastic gradient descent (SGD), dropout, batch normalization, early stopping,and momentum,are used to improve the training process(e.g.,reducing the likelihood of overfitting).

Motif visualization and analysis

We apply the filters embedded in the first convolution layer of our CNN model to generate the sequence motifs of pseudouridylation, using the same strategy as previously described[23,33].More specifically,we use a window of the size equal to the length of the filters (i.e., 8) to scan the flanking regions on both sides of a Ψ site. During this scanning process, those sequence segments (of length 8) with activation values more than half of the maximum score are output. Then these detected sequence segments are converted into the position weight matrix (PWM) form to generate the corresponding motifs representing the sequence contexts of pseudouridylation. To compare these obtained motifs to those known binding patterns of RNA-binding proteins (RBPs) and transcription factors (TFs), we search over the CIS-BP [34]and HOCOMOCO [35] databases (version 2016 for both)using the Tomtom tool [36], respectively, and then cluster all the motifs using RSAT [37] with default parameter settings.The final sequence motifs are visualized using Seq2Logo [38].We also sort out all the generated motifs and perform a clustering analysis on them (File S1).

Transcriptome-wide detection of Ψ sites

We further apply PULSE to detect potential Ψ sites on each transcript along the genome.All the RNA sequences of human and mouse were downloaded from Ensembl by Biomart under references hg19 and mm10, respectively. For each transcript,every thymine site and the flanking 50-nt regions on its both sides are extracted as the input sequence profile to PULSE(‘N’s are padded if the flanking windows are out of the transcripts). Then PULSE computes the local pseudouridylation potential score (lPPS) for each thymine, which measures its pseudouridylation probability.

Transcript pseudouridylation potential score

To evaluate the pseudouridylation potential of a particular transcript,i.e.,the estimation of the overall pseudouridylation level of a complete transcript, we defined a new metric called the transcript pseudouridylation potential score (tPPS). In particular, for a transcripts, its tPPS value can be defined as follows:

in whichnum() represents a count function,I() represents a binary indicator function,lPPSkstands for the lPPS of thek-th Ψ site ins,Krepresents the total number of thymines ins,andLstands for the length ofs.In the aforementioned equation, the numerator represents the ratio between Ψs and thymines, which thus measures the relative abundance of possible Ψs in a transcript.However,this value may bias to those uridine-enriched transcripts. In order to eliminate such bias,the ratio in the numerator is further normalized by the abundance of both thymines and Ψs in the transcript.

Results

Model validation

To evaluate the prediction performance of PULSE,we applied a 10-fold cross-validation procedure on both human and mouse data (see Method). In our training process, the Ψ sites identified from high-throughput profiling experiments and the corresponding flanking regions of 50 nt on both sides of individual Ψ sites were considered positive samples, while uridine sites with flanking windows of 50 nt on both sides that are the closest to some Ψ sites and do not have any overlap with the positive samples were considered as the negative samples. We trained PULSE on human and mouse datasets separately,resulting in two trained models called hPULSE and mPULSE,respectively. We also compared the prediction performance of PULSE to that of a baseline approach, called gkm-SVM,which is a widely-used SVM-based classifier based on gappedk-mers that also uses only sequence information [39]. Our 10-fold cross-validation tests showed that both hPULSE and mPULSE can achieve high prediction accuracy, with the area under the receiver operating characteristic curve(AUC)scores 0.86 and 0.84,respectively,which were significantly better than those of gkm-SVM (Figure 2A and B, Figure S1A and B).We further validated PULSE on several small but reliable datasets,which also displayed high prediction accuracy (File S1).

Previous studies showed that the pseudouridylation profiles of transcriptome across human and mouse were conserved to some extent despite the possible difference in the underlying mechanisms of pseudouridylation [20]. Thus, we performed a cross-species test between human and mouse datasets, that is,we used the PULSE model trained from the human(mouse)pseudouridylation profiles to test the mouse(human)data.As expected, such a cross-species test demonstrated a strong conservation relationship between human and mouse in pseudouridylation (Figure 2C, Figure S1C). In addition, this cross-species validation test also implied an impressive generalization capacity of PULSE in predicting new Ψ sites.

Figure 2 Performance evaluation of PULSE

So far, our models have shown good performances on balanced datasets (i.e., PNR is 1:1). However, in the real world,usually there are much more unmodified or undetected uridines in RNA transcripts than the modified ones, which means that the previous balanced tests may overestimate the precision of the models.To solve this problem,we retrained our models with the same hyperparameters as we searched before and then tested several imbalanced testing sets with different PNRs (including 1:1,1:5,1:10,and 1:20).The tests on these imbalanced datasets also showed competitive performances of our models(Figure 2D and E, Figure S1D and E).

Since the Ψ sites in our training data integrated from several different studies only showed small overlap between each other (Table S1), the trained model may be biased to specific data sources or experiments.To investigate this potential problem and further verify the generalization capacity of our model, we separated all the Ψ sites into independent datasets according to their sources(i.e.,GSE58200[40],GSE60047[18],GSE63655 [20], snoRNABase [41], and Modomics [2]). Then we held out individual datasets as testing data and retrained our model on the remaining datasets. This additional test on individual held-out datasets also showed a high prediction accuracy of our model except on the Modomics of mouse(Figure 2F and G, Figure S1F and G). The reason that held-out Modomics data as testing data in mouse did not yield good prediction performance was probably due to the lack of Ψ sites in small RNAs in the corresponding training data (Table S2).

We further tested PULSE on three small but relatively more reliable datasets(Tables S3 and S4),which also displayed similar high prediction performance (File S1). In addition, we compared PULSE to other state-of-the-art methods,including PPUS [21] and iRNA-PseU [22]. Specifically, we first directly compared the cross-validation results of our models to those of PPUS and iRNA-PseU evaluated on the whole balanced dataset. The results showed that our models performed much better than the others (File S1). To further compare our models to PPUS and iRNA-PseU on imbalanced datasets, we retrained our models and tested them on four imbalanced datasets (i.e., with PNRs 1:1, 1:5, 1:10, and 1:20, respectively)which did not overlap with the training data.Expectedly,these comparison results also showed that our models performed much better than the others (Table 1). In addition, we further evaluated the performances of PPUS and iRNA-PseU on the aforementioned three reliable datasets, which also supported the superior predictive power of our models over the others(File S1).

Table 1 Comparison between PULSE and the previous methods (PPUS and iRNA-PseU) on independent imbalanced datasets

In summary, the aforementioned validation tests implied that PULSE can effectively recognize the underlying latent features of the sequence contexts of pseudouridylation and thus yield accurate prediction of Ψ sites.

The sequence contexts of pseudouridylation captured by PULSE

After PULSE was evaluated, we analyzed and visualized the motifs of the sequence contexts of pseudouridylation captured by the filters employed in the first convolution layer of PULSE, using the same strategy as in the previous studies[23,33]. In particular, we focused on those high-confident motifs that covered more than 1% (about 50) of the positive samples in the training data. As a consequence, we obtained 300 and 272 sequence patterns identified by PULSE for human and mouse,respectively(see Method).These sequence patterns were then clustered into tens of clusters which may imply different subtypes of pseudouridylation (Figure S2; File S1).

As expected, we found that the previously known sequence recognition motifs of PUS4 and PUS7 [20,40],i.e.,‘GUΨCNA’ and ‘UGΨAG’, appeared repetitively in the sequence patterns identified by the filters of our CNN model for both human and mouse (Figure 3A and B). Intriguingly,several novel motifs also appeared repetitively in our visualization results. We hypothesized that these motifs may correspond to other PUSs or recognition proteins. Thus, we mapped our discovered motifs to the known binding motifs of RBPs from the CIS-BP database [34] and TFs from the HOCOMOCO database[35](see Method).As a result,several of these newly discovered sequence motifs of pseudouridylation significantly matched the known binding motifs of nucleotide-binding proteins for both human and mouse(P<1×10-3;Figure 3C and D).We found that these matching motifs were highly related to important RBPs and TFs,e.g., PCBP1 (an RBP involved in the regulation of alternative splicing[42])and FOXO3(a TF acting as a trigger of apoptosis[43]).Moreover,our model also captured the RNA-binding motif of U2AF, which has been reported to lead to a splicingdefect when failing to recognize the pseudouridylated polypyrimidine tract [44]. These discovered novel motifs that matched the known binding sequence patterns of RBPs implied that the corresponding RBPs may play important functional roles in the pseudouridylation process, which thus may also provide new candidate molecules of PUSs for further experimental studies. Since previous studies have shown that RNAs can also be co-transcriptionally modified [45], the TFs with the matching sequence motifs may be related to the triggers of pseudouridylation during RNA transcription.

Figure 3 Examples of the sequence motifs of pseudouridylation identified by PULSE

The transcriptome landscape of Ψs characterized by PULSE

Each uridine in a transcript can be characterized by an lPPS derived from the trained PULSE model based on its corresponding sequence context.Basically,this lPPS value measures the probability that a uridine can be converted into a Ψ.Based on the distribution of uridines on a transcript and the corresponding lPPS profiles, we derived a new metric, called tPPS(see Method), to estimate the overall pseudouridylation level of this transcript.Based on the lPPS and tPPS profiles derived from PULSE, we are able to study the nucleotide- and transcript-level landscapes of pseudouridylation, respectively.

To examine the transcriptome-wide distribution of pseudouridylation (see Method for transcriptome-wide detection of Ψ sites) across different genomic regions, we compared the percentages of pseudouridylation predicted by PULSE among different types of regions, including 5′UTRs, CDS regions,and 3′UTRs.Our comparison showed that Ψs appear primarily in the CDS regions (~50%) and the 3′UTRs(~40%) of both human and mouse mRNAs (Figure 4A),which was consistent with the previous reports[20,40].Considering that the region length and the uridine content may affect the distributions of Ψs in different mRNA regions, we also normalized the aforementioned proportions by the region length and corresponding Ψ ratio (Ψ/T), and observed the similar trend (Figure S3). As the 3′UTRs of mRNAs are tightly associated with RNA stability and translational control[46,47], it is reasonable to hypothesize that the pseudouridylation activities in the 3′UTRs are involved in RNA stability modulation and translational regulation.

Pseudouridylation in RNAs has been considered to play important roles in secondary structure stabilization [4,6]. The structural functions of Ψs in rRNAs and tRNAs have already been relatively well studied[10,48–50],and a noticeable observation in the previous studies was that the experimentally detected Ψs are largely located in the loop regions of an RNA secondary structure. However, the functional roles of Ψs in the structures of other types of RNAs remain poorly understood. Here, we are interested in whether Ψs in other types of RNAs play the same roles in the regulation of RNA secondary structures. We first used the RNAfold software[51] to predict the RNA secondary structures of individual sequences centered at putative Ψ sites of the transcripts with high tPPS values(top 500)predicted by PULSE(see Method).We then compared lPPS profiles between single-strand (SS)and double-strand(DS)regions over different types of RNAs,including lincRNAs, pre-miRNAs, and mRNAs, whose structures are generally more diverse than those of tRNAs and rRNAs. We found that Ψs prefer to occur in SS regions over DS regions in all three types of RNAs for both human and mouse [Figure 4B and C;P<1×10-200, rank-sum test;n(human,lincRNA,SS)=20,918,n(human,lincRNA,DS)=37,852,n(human,pre-miRNA,SS)=3302,n(human,pre-miRNA,DS)=7597,n(human,mRNA,SS)=16,717,n(human,mRNA,DS)=34,357,n(mouse,lincRNA,SS)=28,260,n(mouse,lincRNA,DS)=48,402,n(mouse,pre-miRNA,SS)=3355,n(mouse,pre-miRNA,DS)=7194,n(mouse,mRNA,SS)=18,193,n(mouse,mRNA,DS)=32,270]. In addition, we applied our model to predict the profiles of tRNAs annotated by tRNAscan-SE [52], and found that Ψs also prefer to occur in SS regions of human tRNAs[Figure S4;P=4.30×10-31, rank-sum test;n(human,SS)=6357,n(human,DS)=4513]. For mouse tRNAs, we did not observe the same trend [Figure S4;n(mouse,SS)=4704,n(mouse,DS)=3555], which was probably due to the lack of Ψ sites on small RNAs in the mouse training data (Table S2). We also looked into the predicted Ψ sites of a tRNA corresponding to alanine(tRNAdb ID: tdbR00000017) and compared them to those experimentally validated sites. We found that our model exactly detected two experimentally reported Ψ sites and one potential novel site (Figure S5). Overall, our analysis results were mostly in line with the previous known distributions of Ψs in tRNAs [8,10]. Such similar patterns of Ψ distributions in RNA secondary structures implied that most likely the functional roles of pseudouridylation in regulating RNA structures are generic across all types of RNAs.

To explain the discrepancy in the distributions of Ψs in different types of RNA secondary structures,we hypothesize that Ψs may play an important role in stabilizing or rigidifying RNA secondary structures (Figure 4D). Such a hypothesis is also supported by the previous studies [4,6,7,10] which have shown that Ψs in SS RNAs may interact relatively more easily with other nucleotides to constrain the corresponding loop regions and form more stable conformations. More specially,Ψs in an inner loop (e.g., hairpin loop or internal loop) may pair with nearby nucleotides in space to help stabilize the loop structure,while Ψs in external or flanking SS regions may contribute to supporting and immobilizing their proximal inner loops. Although we cannot rule out that this phenomenon may be caused by the bias of the antibodies used in Ψ identification experiments, which may have different binding affinities to the Ψ sites between RNA SS and DS regions, our new analysis results covered pre-miRNAs,lincRNAs,and mRNAs,and previous known distributions in tRNAs and rRNAs strongly supported our hypothesis.

Figure 4 The transcriptome-level characteristics of pseudouridylation predicted by PULSE

Moreover, the Gene Ontology (GO) enrichment analyses for the top 500 genes with the highest tPPS values for both human and mouse showed that genes with high tPPS values predicted by PULSE are mainly distributed in the nucleus and contribute to DNA or RNA binding (Figure 4E and F;File S1). This phenomenon implied that Ψs in mRNAs may also enhance the bindings between nucleotide-binding proteins and RNAs by increasing their interaction strength and forming more stable complex conformations,which was consistent with the previous results about the potential functions of pseudouridylation in RNA secondary structure and translational regulation [4,6,53].

Pseudouridylation serves as an additional factor in controlling mRNA stability

Previous studies have suggested that pseudouridylation may play an important role in enhancing mRNA stability [4,18],which is probably modulated through the 3′UTRs of mRNAs.To examine more details of this issue, we analyzed the potential relationships between the predicted pseudouridylation potentials of the 3′UTRs of mRNA transcripts and their half-lives. In particular, we first applied PULSE to compute the tPPS values of the 3′UTRs for those transcripts with known half-life information (File S1). We then divided these transcripts into two groups, with the tPPS values of 3′UTRs greater or less than the average level,respectively.The comparison between these two groups showed that mRNAs with higher tPPS values of 3′UTRs tend to have relatively longer half-lives [Figure 5A;P=4.71×10-7, rank-sum test;n(HightPPS)=3215,n(LowtPPS)=4260]. We also performed a similar analysis on the relationships between the tPPS values of CDS regions and mRNA half-lives, but did not observe any significant effect of Ψs in the CDS regions on mRNA half-lives [Figure 5B;P=0.17, rank-sum test;n(HightPPS)=3264,n(LowtPPS)=4217]. These results indicated that Ψs in the 3′UTRs of mRNAs may improve their stability, which was also supported by the previous study [53]. On the other hand, previous studies also reported that the length and GCcontent of the 3′UTR of an mRNA can affect its stability,that is, short 3′UTR and high GC-content can promote mRNA decay through distinct mechanisms [54,55]. To decouple the effects of these two factors and pseudouridylation on RNA stability, we also investigated the relationships between the tPPS values of 3ʹUTRs and their lengths and GC-contents.As a result, we observed that those 3′UTRs with larger tPPS values are significantly shorter and tend to have higher GC-contents [Figure 5C and D;P=3.07×10-53for the length of 3ʹUTR,P=1.71×10-267for the GC-content of 3ʹUTR,rank-sum test;n(HightPPS)=3215,n(LowtPPS)=4260],which implied that Ψs in 3′UTRs can compensate for the down-regulation effects of RNA stability caused by their short lengths and high GC-contents. To further verify this relationship, we performed additional analyses on several other curated RNA half-life datasets.The additional analysis results suggested that the relationship mentioned above is probably robust to different cell lines (Figure S6; File S1).

Pseudouridylation fine-tunes the effects of codon bias to maintain translation efficiency

It has been reported that Ψs in mRNAs may affect their translation fidelity[12].Moreover,when uridines in stop codons are pseudouridylated,ribosomes may read through without translation termination,that is,tRNAs can also bind to these modified stop codons and continue the translation process [56].These previous studies highlighted the potential regulatory functions of Ψs in translation through changing the properties of the corresponding codons.The distinct distributions of lPPS values of uridines in different codons at individual positions may also support the aforementioned potential regulatory functions of pseudouridylation (Figures S7–S9). Here, we investigated the relationships between the lPPS values of individual uridine-containing codons and different indices of codon usage bias, including the tRNA adaptation index(tAI) [57], the codon adaptation index (CAI) [58], and the%MinMax metric (Table S5; File S1). In general, RNA regions with loose structures where ribosomes can move forward relatively more rapidly tend to have higher tAI and CAI values. On the other hand, pseudouridylation may act as a stumbling block to impede ribosome movement by increasing the rigidity of local conformations during a translation process. Thus, we speculate that pseudouridylation is more likely to happen in codons with relatively lower tAI and CAI values. To validate this speculation, we divided all codons into two groups according to their lPPS values,i.e.,based on whether their lPPS values were greater than the average level or not.We then compared the tAI and CAI values of the codons between these two groups,respectively.As a result,codons with higher lPPS values displayed significantly lower tAI and CAI values for both human and mouse (Figure 5E and F;P<1×10-200, rank-sum test). These results implied that pseudouridylation prefers to occur in relatively rare codons or codons with lower supply of their corresponding tRNAs. Furthermore, for codon rareness, we considered the%MinMax metric that has been developed to evaluate the relative rareness of codons in a coding sequence [59].We performed the same analysis for %MinMax as we conducted for tAI and CAI. The comparison of the %MinMax values between codons with high and low lPPS values demonstrated that Ψs also prefer to occur in relatively rare codons(Figure 5E and F;P<1×10-200, rank-sum test), which was consistent with our previous results. In summary, our analysis results showed that pseudouridylation prefers to occur in rare codons,which suggested that pseudouridylation may be involved in controlling the rhythm of translation and perhaps the co-translational folding of the nascent peptide chains.

Figure 5 The functional roles of pseudouridylation inferred by PULSE

Pseudouridylation modulates the translation initiation and elongation processes

It has been revealed that pseudouridylation in mRNAs can enhance the translational capacity [53], which implys that Ψs may play an important role in mRNA translation. Previously,we showed that pseudouridylation is codon biased and may be involved in translation regulation through codon fine-tuning.To further investigate whether pseudouridylation may participate in the modulation of translation rates,we also performed an integrative analysis by combining PULSE prediction results with ribosome profiling data that describe the translation initiation and elongation processes (File S1).

To reveal the functional roles of pseudouridylation in translation initiation,we first collected the human ribosome profiles of translation initiation positions (File S1) and selected initiation codon sites near uridines (i.e., within the range of +/-1 codon). After that, we ran the PULSE model to calculate the lPPS values of the flanking sequences centered at the uridines that were closest to the selected initiation codon sites.Next,we extracted two groups of codons from these selected initiation positions, which had lower than 25% quantile (termed by‘< 1st Qu.’) and greater than 75% quantile (termed by‘> 3rd Qu.’) of lPPS values, respectively (File S1). We then compared the normalized ribosome densities (NRDs) of the codon sites in these two groups. The comparison showed that the Ψ sites with higher lPPS values are more likely to be located in regions with lower initiation ribosome densities(Figure 5G;P<1×10-200, rank-sum test). This result suggested that Ψs in the translation initiation regions may help reduce the accumulation of ribosomes.We speculated that this may be realized by attracting the rRNAs in the small subunits(SSUs)of ribosomes through the formation of extra hydrogen bonds,which can thus accelerate the installing process of initiating ribosomes (Figure 5J).

To decipher the impact of pseudouridylation on the translation elongation process, we explored the human ribosome profiles of translation elongation obtained from a previous study [60], and integrated them with the PULSE prediction results. We first processed the ribosome profiling data using the same strategy as we conducted in the analysis of translation initiation,except that here we mainly focused on the codons in CDS regions. Unsurprisingly, we found that pseudouridylation prefers to occur in the regions with relatively slow elongation rates (Figure 5H;P<1×10-200, rank-sum test). This result implied that Ψs in mRNAs may help modulate the translation elongation rates.We speculated that Ψs in mRNAs may serve to affect the elongation process by dragging the aminoacyl tRNAs to the A sites of ribosomes during elongation through the force of extra hydrogen bonds (Figure 5J).

To investigate more details about the relationships between pseudouridylation and translation, we further compared the ribosome densities at different lPPS thresholds ranging from 0.1 to 1.0 with an increment of 0.1. We found that the ribosome densities significantly decreased along the lPPS values from 0.1 to 1.0 (Figure 5I). This result demonstrated the robustness of the relationships between pseudouridylation and translation that we previously claimed.

Figure 6 Relationships between pseudouridylation and SNVs

In summary,Ψs in mRNAs are involved in modulating the translation process, including both initiation and elongation,probably by strengthening the bindings of ribosomes or tRNAs to mRNAs. Of course, unveiling the detailed underlying mechanisms will still require further extensive experimental studies.

Relationships between pseudouridylation and SNVs

The sequence contexts of Ψs captured by PULSE can enable us to investigate the functional effects of SNVs on pseudouridylation. To demonstrate this point, we first applied PULSE to predict the lPPS profiles of the major alleles and the corresponding minor alleles for SNVs that have been annotated by the current genome-wide association studies(GWAS)and validated by 1000Genomes (File S1). Next, we calculated the log2fold change of lPPS values between major and minor alleles, which was termed as the allele fold change of pseudouridylation potential (AFCP; File S1). Interestingly, we found that when the T allele was replaced by another allele(i.e., A, C, or G), the corresponding lPPS values dropped significantly. On the other hand, when the other alleles were replaced by the T allele, the corresponding lPPS values increased significantly (Figure 6A). This result implied that, a uridine site with a T allele in its contextual sequence is more likely to be pseudouridylated than another one with other alleles in its contextual sequence.

Previous studies showed that SNVs in RNAs may affect stabilities of their secondary structures[61–63].To investigate the relationships between pseudouridylation and the effects of SNVs on RNA structure stability, we first used RNAfold [51]to estimate the minimum free energy (MFE) values of the sequences of both major and minor alleles and calculated the change of MFE(denoted by ΔMFE;File S1).We then compared the minor allele frequency(MAF)values between two groups of SNVs, which had ΔMFE greater than 1.0 kCal/Mol and less than -1.0 kCal/Mol, respectively. These two thresholds were chosen according to the 80% and 20% quantiles of all the ΔMFE values.The comparison result showed that alleles with high MAF values are more likely to occur in relatively unstable RNA regions,i.e., with relatively high MFE, which generally correspond to flexible RNA regions,e.g.,SS regions[Figure 6B;P=4.87×10-9, rank-sum test;n(ΔMFE>1.0)=3248,n(ΔMFE<-1.0)=3347]. From this observation, we speculate that SNVs may act on RNA structures through affecting the pseudouridylation profiles of their corresponding sequences.To study this issue,we first compared the MAF values between two groups of SNVs,which had AFCP values greater than 1.0 and less than-1.0,respectively.This comparison showed that SNVs with larger MAF values are more likely to be pseudouridylated[Figure 6C;P=4.85×10-6,rank-sum test;n(AFCP>1.0)=2207,n(AFCP<-1.0)=2568]. In addition, we compared the ΔMFE values between two groups of SNVs with AFCP values greater than 0.0 and less than 0.0,respectively.As a result,we found that alleles with larger predicted pseudouridylation potentials are associated with relatively higher MFE values [Figure 6D;P=1.23×10-33, rank-sum test;n(AFCP>0.0)=7853,n(AFCP<0.0)=8661]. Note that here,although the energy gap(ΔMFE)between major and minor alleles was small(i.e.,within a range of ~3 kCal/Mol),it can lead to a dramatic transformation of RNA secondary structure(Figure S10).In addition to the aforementioned results,we also performed similar analyses based on the prediction results from remuRNA[64]and RNAsnp[65],which predict an ensemble of RNA secondary structures,andobserved similar trends(Figures S11 and S12). Overall, our analyses indicated that SNVs can affect the pseudouridylation potentials of RNA sequences to change the stability of their secondary structures.

We also performed an enrichment analysis of the variants with relatively high AFCP values(i.e.,|AFCP|>2.0)over different types of genomic regions. The enrichment analysis results showed that SNVs with high AFCP values are significantly enriched in the splice donor regions (Figure 6E;P=7.04×10-3, hypergeometric test) and depleted in the 3′UTRs (Figure 6E;P=5.34×10-3, hypergeometric test),when compared to the background (i.e., the set of all SNVs used in the analysis).These results implied that pseudouridylation may be relatively more sensitive to SNVs in the contextual sequences that are related to RNA splicing and regulatory functions of 3′UTRs. We also combined the traits associated to the SNVs with the predicted lPPS values to illustrate latent relations between pseudouridylation and important phenotypes,such as complex diseases.In particular, we first selected the top 10 variants with the highest absolute AFCP values,including rs7190997, rs11073328, rs4846923, rs10007186,rs11635553, rs72807343, rs12150660, rs7286917, rs7089227,and rs1934179. We then investigated the disease traits associated with these 10 selected SNVs. Interestingly, we found that these 10 variants were mainly associated with immune system lesions (Figure 6F). For example, rs11073328 and rs11635553 have been reported to be related to rheumatoid arthritis [66]and IgG glycosylation [67], respectively. This result implied that pseudouridylation may play an important role in the immune system.

Discussion

Based on PULSE we showed that pseudouridylation is codon biased and closely related to RNA translation. We believed that these relationships were constructed during a long evolution process of epigenome. Actually, it has been widely believed that rare codons appear mainly due to the codon usage bias resulting from mutation and natural translation selection[68].Codon usage plays an essential role in regulating multiple levels of cellular processes, such as translation and protein folding. During the translation process, codons are carefully selected to achieve accurate translation and thus optimal cellular fitness to a certain context,e.g., expression of a certain gene in a certain organism or under certain conditions[69]. Despite that rare codons in an mRNA transcript may decrease its translation efficiency,they are generally important for regulating protein folding and RNA stability [70,71].Although codon usage under selection pressure during the evolution process can fine-tune gene expression,it may not be able to have a quick response to a sudden change caused by environmental stimulation. On the other hand, pseudouridylation can provide an additional factor to further fine-tune the translation process. Under a certain cellular condition, pseudouridylation can increase the translation speed of original rare codons probably through strengthening the binding of ribosomes or tRNAs to mRNAs, to ensure the efficient translation of functionally important residues in proteins, when responding to dramatic environmental changes. Those rare codons without such a fine-tuning function may die out during molecular evolution and their functions in the control of translation may be lost. Thus, from an evolutionary point of view,selecting rare codons for pseudouridylation may promote the conservation of certain rare codons in the genome.

Our analyses indicated that pseudouridylation can fine-tune RNA stability. Note that there are many potential biological factors that can affect RNA stability, such as the regulation of RBPs, RNA modification, polyadenylation, and miRNAmediated regulation. Thus, most likely the half-life of an RNA is influenced by a mixed effect of all these factors. Our results indicated that although with a small effect(Figure 4A),pseudouridylation can significantly contribute to the regulation of RNA stability.Such a finding has an important biological implication. In fact, in the literature, similar phenomena have also been observed. For example, it has been found thatN6-methyladenosine can significantly modulate mRNA translation efficiency, although only with a marginal effect [72].

RNA pseudouridylation is obviously crucial to RNA regulation simply by its prevalence in transcriptome and its high conservation across different species.Therefore,a comprehensive understanding of RNA pseudouridylation will be conducive to the consummate studies of RNA modifications and RNA epigenetics.The studies of RNA pseudouridylation especially in mRNAs may help understand its functional roles in post-transcriptional regulation. Given the complication of the underlying pseudouridylation mechanisms (e.g., there are at least 13 types of PUSs)and the limitations of current experimental profiling techniques, it is generally difficult to explore the biological functions of pseudouridylation through conventional experimental methods (e.g., gene knockdown experiment). This challenge can also partially explain why pseudouridylation is relatively less studied compared to other RNA modifications.Here,our proposed model provides a natural method to unify all current available Ψ profiling data and fully exploit the underlying contextual sequence features of pseudouridylation. Such a strategy can take advantage of the learning and predictive power of the CNN model to reveal the characteristics and potential functional roles of pseudouridylation by connecting the prediction results to the profiles of other biological factors or processes (e.g., RNA secondary structure, translation initiation and elongation),which can provide useful hints into understanding the underlying mechanisms of pseudouridylation.

Conclusion

In this study, we developed an effective CNN model to detect Ψ sites, based on which we further analyzed the landscape of Ψs across the human and mouse transcriptomes. Our model can not only capture the known motifs of pseudouridylation that were consistent with previous studies, but also reveal novel sequence patterns that may help uncover potential new PUSs. The analysis of the associations between SNVs and the changes of pseudouridylation potentials based on the sequence contexts captured by our model showed that pseudouridylation may be involved in several complex diseases,such as rheumatoid arthritis (associate trait of the rs11073328 SNV) and Alzheimer’s disease (associate trait of the rs72807343 SNV) [73]. Our extensive analysis on the relationships between predicted pseudouridylation scores and different types of RNA secondary structures showed that Ψs are more likely to occur in SS RNA regions rather than DS RNA regions, which led to a speculation that Ψs may act as an anchor to stabilize or rigidify RNA structures. Comparison of half-lives between mRNA transcripts with high and low tPPS values of their 3′UTRs derived from the prediction results showed that Ψs in the 3′UTRs of mRNA transcripts may enhance their stability. Also, the GO enrichment analysis of genes with high pseudouridylation scores predicted by our model may provide useful hints for understanding the biological functions of pseudouridylation.We also showed that pseudouridylation is codon biased and uridines in rare codons are more likely to be pseudouridylated, which may serve as an important regulatory strategy for achieving optimal mRNA translation. Comparisons of ribosome occupancy densities between positions with high and low pseudouridylation potentials predicted by our model for both translation initiation and elongation showed that pseudouridylation often occurs in the ribosome sparse regions, which implied that Ψs may promote the translation process by enhancing the interactions between ribosomes and RNAs. We believe that these results can provide novel insights into the studies of pseudouridylation and our computational framework can also inspire studies on other types of RNA modifications.

Code availability

The source code, the data files used in the analyses, and the final PULSE model can be downloaded from https://github.com/mlcb-thu/PULSE.

CRediT author statement

Xuan He:Conceptualization, Methodology, Software,Formal analysis,Investigation,Data curation,Writing-original draft,Visualization.Sai Zhang:Methodology, Supervision.Yanqing Zhang:Data curation.Zhixin Lei:Validation.Tao Jiang:Supervision, Writing - review & editing, Funding acquisition.Jianyang Zeng:Conceptualization, Methodology, Writing -review&editing,Supervision,Project administration,Funding acquisition. All authors have read and approved the final manuscript.

Competing interests

The authors have declared no competing interests.

Acknowledgments

We are grateful to Fangping Wan,Hailin Hu,and Guangxiang Zhu of Jianyang Zeng’s lab for their helpful suggestions about this work. We thank Prof. Chengqi Yi (Peking University,China) for the helpful discussions and suggestions on our work.This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61472205 and 81630103), the US National Science Foundation (Grant Nos.DBI-1262107 and IIS-1646333), the China’s Youth 1000-Talent Program, and the Beijing Advanced Innovation Center for Structural Biology.

Supplementary material

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2019.11.015.

ORCID

ORCID 0000-0003-1121-9449 (Xuan He)

ORCID 0000-0001-5996-6086 (Sai Zhang)

ORCID 0000-0001-9849-8422 (Yanqing Zhang)

ORCID 0000-0002-1221-487X (Zhixin Lei)

ORCID 0000-0003-3833-4498 (Tao Jiang)

ORCID 0000-0003-0950-7716 (Jianyang Zeng)