Comparing Segmentation Methods for Genome Annotation Based on RNA-Seq Data

Cleynen, Alice; Dudoit, Sandrine; Robin, Stéphane
March 2014
Journal of Agricultural, Biological & Environmental Statistics (;Mar2014, Vol. 19 Issue 1, p101
Academic Journal
Transcriptome sequencing (RNA-Seq) yields massive data sets, containing a wealth of information on the expression of a genome. While numerous methods have been developed for the analysis of differential gene expression, little has been attempted for the localization of transcribed regions, that is, segments of DNA that are transcribed and processed to result in a mature messenger RNA. Our understanding of genomes, mostly annotated from biological experiments or computational gene prediction methods, could benefit greatly from re-annotation using the high precision of RNA-Seq. We consider five classes of genome segmentation methods to delineate transcribed regions, including intron/exon boundaries, based on RNA-Seq data. The methods provide different functionality and include both exact and heuristic approaches, using diverse models, such as hidden Markov or Bayesian models, and diverse algorithms, such as dynamic programming or the forward-backward algorithm. We evaluate the methods in a simulation study where RNA-Seq read counts are generated from parametric models as well as by resampling of actual yeast RNA-Seq data. The methods are compared in terms of criteria that include global and local fit to a reference segmentation, Receiver Operator Characteristic (ROC) curves, and coverage of credibility intervals based on posterior change-point distributions. All compared algorithms are implemented in packages available on the Comprehensive R Archive Network (CRAN, ). The data set used in the simulation study is publicly available from the Sequence Read Archive (SRA, ). While the different methods each have pros and cons, our results suggest that the EBS Bayesian approach of Rigaill, Lebarbier, and Robin () performs well in a re-annotation context, as illustrated in the simulation study and in the application to actual yeast RNA-Seq data. This article has supplementary material online.


Related Articles

  • Compressed suffix tree—a basis for genome-scale sequence analysis. Niko Välimäki; Wolfgang Gerlach; Kashyap Dixit; Veli Mäkinen // Bioinformatics;Mar2007, Vol. 23 Issue 5, p629 

    Summary: Suffix tree is one of the most fundamental data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily...

  • A method for enhancement of short read sequencing alignment with Bayesian inference. Weixing Feng; Fengfei Song; Yansheng Dong; Bo He // Journal of Chemical & Pharmaceutical Research;2013, Vol. 5 Issue 11, p200 

    Next-generation short read sequencing is widely utilized in genome wide association study. However, as an indirect measurement technique, short read sequencing requires alignment step to map all sequencing reads to reference genome before acquiring interested genomic information. Facing to huge...

  • Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. Redhead, Emma; Bailey, Timothy L. // BMC Bioinformatics;2007 Supplement 2, Vol. 8, p385 

    Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only...

  • Comparative genomics: Lining up is hard to do. Casci, Tanita // Nature Reviews Genetics;Aug2008, Vol. 9 Issue 8, p573 

    The article reports on the differences in DNA alignments. This paper highlights the differences in DNA line up stretches as the methods used vary between different phylogenetic history. Aligning multiple sequences often take on a challenge as this may cause algorithmic flaws in DNA sequencing...

  • Classification of DNA sequences using Bloom filters. Stranneheim, Henrik; Käller, Max; Allander, Tobias; Andersson, Björn; Arvestad, Lars; Lundeberg, Joakim // Bioinformatics;Jul2010, Vol. 26 Issue 13, p1595 

    Motivation: New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the ‘novel’ sequences in a complex dataset that are of interest and the superfluous sequences need to be...

  • A Novel DNAZIP Tool for Zipping of Genome Sequences by Linear Bounded Data Structure. Prasad, V. Hari; Kumar, P. V. // International Journal of Computer Applications;May2012, Vol. 46, p9 

    In Of late due to excessive accumulation of genetic sequences need of vacuuming plays predominant role in database when it reaches to its threshold. Vacuuming refers to data should not be deleted physically but it is superseded. The achieved data can be stored in secondary storage, we can access...

  • Efficient alignment of pyrosequencing reads for re-sequencing applications. Fernandes, Francisco; da Fonseca, Paulo G. S.; Russo, Luis M. S.; Oliveira, Arlindo L.; Freitas, Ana T. // BMC Bioinformatics;2011, Vol. 12 Issue 1, p163 

    Background: Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the...

  • Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. Abnizova, Irina; te Boekhorst, Rene; Walter, Klaudia; Gilks, Walter R. // BMC Bioinformatics;2005, Vol. 6, p1 

    Background: This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information. Results: We present a novel statistical...

  • MSuPDA: A Memory Efficient Algorithm for Sequence Alignment. Khan, Mohammad; Kamal, Md.; Chowdhury, Linkon // Interdisciplinary Sciences: Computational Life Sciences;Mar2016, Vol. 8 Issue 1, p84 

    Space complexity is a million dollar question in DNA sequence alignments. In this regard, memory saving under pushdown automata can help to reduce the occupied spaces in computer memory. Our proposed process is that anchor seed (AS) will be selected from given data set of nucleotide base pairs...


Read the Article

Courtesy of

Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics