On a novel statistical method for isoform quantification using RNA-seq data
Indranil Mukhopadhyay (Indian Statistical Institute)
Friday 1st November 15:00-16:00 Maths 311B
Technological advances have triggered the generation of massively parallel genome-wide transcriptome data, known as RNA-seq data. A major problem in analysing such data is the correct quantification of transcripts. However, comprehending the distribution of reads, ambiguity in mapping
reads to proper isoforms etc. can lead to problems in modeling and estimation
of transcript abundance.
We develop a novel statistical method for estimation of isoform level abundance using a maximum likelihood approach under general conditions of the nature and distribution of reads. Our likelihood function is of multinomial type with indicators as latent variables. We adopt an EM algorithm to obtain
exact estimates and avoid approximations or plug-in estimates in maximizing the likelihood function, unlike existing methods. We have studied our method extensively using simulated and real datasets.
We did simulations under various models assuming different distributions of reads. Our method shows promising results and outperforms other methodssignificantly, especially when (1) the number of alternately spliced isoforms is large, and (2) some isoforms are extremely low in abundance. Our
method is also robust to the probability distribution of reads, more accurate and applicable even with a mixture of paired- and single-end reads, scalable with respect to memory allocation, and computationally very fast. It shows high correlation with qRT-PCR estimates when applied to a
real dataset. Confidence intervals calculated using our method are narrower than Cufflinks estimates. Based on its performance on simulated and realdatasets, we believe that it will be an extremely useful and feasible approach in practical implementation with real data.