Xudong Zou,Xin Gao*,Wei Chen*
1 Department of Biology,Sou The rn University of Science and Technology,Shenzhen 518055,China
2 Computational Bioscience Research Center(CBRC),ComputeRElectrical and Ma The matical Sciences and Engineering(CEMSE)Division,King Abdullah University of Science and Technology(KAUST),Thuwal 23955,Saudi Arabia
The ever-increasing high-volume and high-dimensional genoMics datAon The one hand challenge traditional datAanalysis approaches,and on The o The Rhand provide amp le opportunities foRdeveloping novel analytic strategies.In recent years,deep learning has been driving The nextwave of artificial intelligence and machine learning.Now,Yi Xing’s lab reported DARTS[1],Anovel coMputational framework that leverages The poweRof both deep learning and Bayes hierarchical framework foRdifferentialalternative sp licing(AS)analysis.Trained on The huge volume of publicly-available RNA-seq datasets,DARTS could largely increase The accuracy of AS analysis,in particulaRfoRthose With loWsequencing depth,by taking both genoMic features and expression levels of RNA-binding proteins(RBPs)into consideration.
In higheReukaryotes,The vastmajority of protein-coding genes are transcribed into precursorMRNA(pre-mRNA)containing exons and introns that need to be removed by The sp licing machinery to generate mature MRNA.O ften The transcript can be spliced in different ways,leading to Adifferent combination of exons.AS contributes to The variety of The cellulaRproteome aswellas to The fine tuning of gene expression levels at The post-transcriptional level.The regulation of AS ismediated by The interaction between cis-elements around The sp licing site in both exonic and intronic regions,and trans-acting factors that bind to specific cis-elements.It has been shown that AS p lays critical roles in Avariety of physio-pathological processes.
In The lastdecade,The amountof RNA-seq datAhas soared,which providesvaluable resources foRextensive studiesof transcriptional and post-transcriptional regulation.In addition to providing The information on RNAabundance,RNA-seq datAcould also be used to infeRThe AS pattern,and more of ten to identify The differential AS between different samp les,such as those froMdifferent developmental stages,normal vs.disease,aswellas control vs.treatment.FoRThe latter,many computationalmethods have been developed.The common strategy underlying The se methods is to use The numbeRof RNA-seq reads exclusively supporting ei The Risof orMto estimate an abundance ratio between The two sp liced isof orMs,i.e.,inclusion and exclusion isof orm,and The n perforMAstatistical test to deterMine whe The RThe sp licing pattern between The two saMp les issignificantly differentoRnot.Although imp lemented With different statistical frameworks,all The semethodswould encounteRhigh uncertainty foRThe sp licing events sampled With loWsequencing coverage. The refore,The sensitivity is ra The RliMited in detecting differential AS foRThe loWlyexpressed genes.Moreover,many currently available RNAseq datasets originally designed only foRdifferential analysis of gene expression are of ten of loWsequencing depth,which is insufficient foRAS analysis even foRmoderately-expressed genes.Harnessing The se valuable resources foRASstudieswarrants novel analytic strategies.
Deep learning has recently reemerged in various fields(e.g.,image recognition and language processing)with great success.Unlike The traditional machine learning algorithms,deep learning trains both The feature extractoRand The classifieRsimultaneously.The high model comp lexity caused by The extraordinarily large numbeRof parametersmakes deep learning models data-hungry.Such high model flexibility,on The o The Rhand,toge The RWith The powerful optiMization algorithMs,enables deep learning to achieve The state-of - The -art performance on AWide spectruMof app lications where large datasets are available,such as computeRvision,natural language processing,and genoMics.
The seMinalwork on developing deep learningmethods to decipheRThe sp licing codewas done by Leung and colleagues[2]. The y studied The tissue-specific sp licing code of five tissues in Mice.FoReach exon, The iRmodel takes 1393 manuallyextracted features,including those froMexon,neighboring intron,ad jacent exon,as well as tissue type indicators,as inputs,and predicts The range(low,medium,oRhigh)of The percentageof isof orMincluding thatexon(The Percent Sp licing In(PSI)value)andΔPSI between two tissues.In Afollow-up study,Xiong et al.improved The model to predict The exact value of PSI by using The same set of features and applied The model to detect splicing-affecting variants that are associated With human diseases[3].Recently,BretschneideRet al.fur The Rdeveloped fouRdifferent deep learningmodels to predict alternative acceptoRsites and alternative donoRsites[4].In contrast to The previous work from The same lab,BretschneideRet al.leveraged The poweRof deep learning to automatically extract important features foRraw DNAsequences and builtmodels to simultaneously predict The PSI values of all The alternative sites With an accuracy of 70%.More recently,Jaganathan et al.also developed Adeep learningmethod to predict whe The Reach position in The transcript could function asAsplice donoRoRAsp lice acceptor,oRnei The Rof The m[5].Compared to previous methods that relied on human-designed features,oRhave only considered short nucleotide Windows adjoining exon-intron boundaries,this method learns sp licing deterMinants froM10,000 nucleotides around each candidate position,With A95%top-k accuracy.None The less,AS is regulated by The interp lay between cis-regulatory elements and trans-acting factors, The se deep learning models weremostly focused only on The contribution of cissequence features and have largely ignored trans-environment.AsAresult, The y could not,foRinstance,tellany differentialAS between two samples With The same genoMic sequences but undeRdifferent conditions.
The neWtool,DARTS,mainly consists of two coMponents,i.e.,Adeep learning model(DARTS DNN)to estimate The prioRprobability and Alikelihood estimator(DARTS BHT)based on The prioRprobability aswellasRNA-seq read counts.Before training DARTSDNN,large-scale RNA-seq datAare analyzed first by DARTS BHT With uninformative prioRto generate Ahigh-confidence labeled training dataset that containsboth differentialsplicing and unchanged sp licing between conditions. The n The labeled training dataset is used foRtraining DARTS DNN.In contrast to The aforementioned deep learning-based ASmethods,this deep learning module incorporates not only The cis-elements froMThe primary genoMic sequences but also The trans-elements represented by The expression level of 1498 sp licing-relevant RBPs.Zhang et al.first evaluated The performance of DARTS on The test datAcorresponding to leave-out RBPs and showed that DARTS outperformed The baseline methods. The y The n app lied DARTS on two cell lines to infeRcell-type-specific sp licing events,in which The y found that The performance of DARTS BHT With an informative prioRprobability is betteRthan that Without The prior,demonstrating that incorporating DNN prediction as an informative prioRiMproves The performance of DARTS BHT in detecting differential sp licing.To fur The Rdemonstrate The poweRof DARTS DNN on o The RRNA-seq datasets, The y trained three DARTS DNN models using ENCODE datAonly,Roadmap datAonly,and The iRcombination,respectively. The y found that The model trained on ENCODE datAhas high predictive poweRfoRThe ENCODE leave-out datasets,butmodest predictive poweRfoRRoadmap leave-out datasets,and vice versa,while The model trained on The combination of both datasets has The best performance.Fur The rmore, The y extended DARTS DNN to o The Rtypes of AS events,i.e.,alternative 5′oR3′sp lice sites and retained introns,and The y also achieved Ahigh prediction accuracy.Finally, The y applied DARTS to investigate The change of AS pattern during The epi The lial-mesenchymal transition(EMT)using The previously published RNA-seq dataset[6].Using DARTS, The y were not only able to predict high-confidence differential versus unchanged sp licing events during The EMT,but also uncoveRdifferential AS events froMloWlyexpressed genes.Importantly,The latteRcould successfully be experimentally validated,again demonstrating The improved accuracy of DARTS on ASWith loweRsamp ling depth.
The majoRinnovation of DARTS lies in two aspects.(1)DARTScombinesAdeep learningmodelWith Bayeshierarchical framework:The formeRprovides The latteRAprioRbased on learned knoWledge about each AS event in Aspecific sample,while The latteRfur The Rintegrates The information froMRNA-seq data.(2)The deep learning model Within DARTS framework foRThe first time takes both cis-elements and trans-factors into consideration,which iMproves differential AS detection between conditions.
The re are yet some directions foRfur The Rdevelopment of DARTS.First,although DARTS can The oretically capture The cis-trans interactions,such association requires Aprohibitively large numbeRof input combinations.Second,DARTS is trained on invariantgenoMic sequences froMdifferent samp les,and thus could not capture The sp licing landscape of sequence variants.Third,The performance of DARTSmay be fur The RiMproved by incorporating increased lengths of flanking regions oRmore cis-features.However,it requires more datAand sophisticated feature engineering to obtain Abettermodel.
O The Rthan AS,alternative polyadenylation(APA)isalso Akey,but less-well studied step in RNAprocessing.And conceptually,siMilaRto AS,The regulation of APAis also mediated by cis-trans interaction. The refore,APA regulation could be treated as AsiMilaRprobleMand accordingly investigated With AsiMilaRstrategy.X iAet al.recently developed Arobust,poly(A)signal(PAS)motif-agnostic,and transferable deep learningmodel to differentiate true PASs froMfalse ones[7].The ideas of DARTS could potentially be app lied to combine The poweRof novel deep learning based computational algorithMs and RNA-seq based experimental datAfoRAPAanalysis.
The authors have declared no competing interests.
Thiswork was supported by The Basic Research G rant(G rant No.JCYJ20170307105752508)froMThe Science and Technology Innovation ComMission of Shenzhen Municipal Government,ChinAand The K ing Abdullah University of Science and Technology(KAUST)O ffice of Sponsored Research(OSR),Saudi Arabia(G rant Nos.FCC/1/1976-04,URF/1/2602-01,URF/1/3007-01,URF/1/3412-01,URF/1/3450-01,and URF/1/3454-01).
Genomics,Proteomics & Bioinformatics2019年2期