Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges

El-Metwally, Sara; Hamza, Taher; Zakaria, Magdi; Helmy, Mohamed
December 2013
PLoS Computational Biology;Dec2013, Vol. 9 Issue 12, p1
Academic Journal
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.


Related Articles

  • Evaluation of Different Reference Based Annotation Strategies Using RNA-Seq - A Case Study in Drososphila pseudoobscura. Palmieri, Nicola; Nolte, Viola; Suvorov, Anton; Kosiol, Carolin; Schlötterer, Christian // PLoS ONE;Oct2012, Vol. 7 Issue 10, Special section p1 

    RNA-Seq is a powerful tool for the annotation of genomes, in particular for the identification of isoforms and UTRs. Nevertheless, several software tools exist and no standard strategy to obtain a reliable annotation is yet established. We tested different combinations of the most commonly used...

  • Compressed suffix tree—a basis for genome-scale sequence analysis. Niko Välimäki; Wolfgang Gerlach; Kashyap Dixit; Veli Mäkinen // Bioinformatics;Mar2007, Vol. 23 Issue 5, p629 

    Summary: Suffix tree is one of the most fundamental data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily...

  • FunFrame: functional gene ecological analysis pipeline. Weisman, David; Yasuda, Michie; Bowen, Jennifer L. // Bioinformatics;May2013, Vol. 29 Issue 9, p1212 

    Summary: Pyrosequencing of 16S rDNA is widely used to study microbial communities, and a rich set of software tools support this analysis. Pyrosequencing of protein-coding genes, which can help elucidate functional differences among microbial communities, significantly lags behind 16S rDNA in...

  • Fast and accurate read alignment for resequencing. Mu, John C.; Jiang, Hui; Kiani, Amirhossein; Mohiyuddin, Marghoob; Bani Asadi, Narges; Wong, Wing H. // Bioinformatics;Sep2012, Vol. 28 Issue 18, p2366 

    Motivation: Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large...

  • Comparative genomics: Lining up is hard to do. Casci, Tanita // Nature Reviews Genetics;Aug2008, Vol. 9 Issue 8, p573 

    The article reports on the differences in DNA alignments. This paper highlights the differences in DNA line up stretches as the methods used vary between different phylogenetic history. Aligning multiple sequences often take on a challenge as this may cause algorithmic flaws in DNA sequencing...

  • Classification of DNA sequences using Bloom filters. Stranneheim, Henrik; Käller, Max; Allander, Tobias; Andersson, Björn; Arvestad, Lars; Lundeberg, Joakim // Bioinformatics;Jul2010, Vol. 26 Issue 13, p1595 

    Motivation: New generation sequencing technologies producing increasingly complex datasets demand new efficient and specialized sequence analysis algorithms. Often, it is only the ‘novel’ sequences in a complex dataset that are of interest and the superfluous sequences need to be...

  • Comparing Segmentation Methods for Genome Annotation Based on RNA-Seq Data. Cleynen, Alice; Dudoit, Sandrine; Robin, Stéphane // Journal of Agricultural, Biological & Environmental Statistics (;Mar2014, Vol. 19 Issue 1, p101 

    Transcriptome sequencing (RNA-Seq) yields massive data sets, containing a wealth of information on the expression of a genome. While numerous methods have been developed for the analysis of differential gene expression, little has been attempted for the localization of transcribed regions, that...

  • A Novel DNAZIP Tool for Zipping of Genome Sequences by Linear Bounded Data Structure. Prasad, V. Hari; Kumar, P. V. // International Journal of Computer Applications;May2012, Vol. 46, p9 

    In Of late due to excessive accumulation of genetic sequences need of vacuuming plays predominant role in database when it reaches to its threshold. Vacuuming refers to data should not be deleted physically but it is superseded. The achieved data can be stored in secondary storage, we can access...

  • Efficient alignment of pyrosequencing reads for re-sequencing applications. Fernandes, Francisco; da Fonseca, Paulo G. S.; Russo, Luis M. S.; Oliveira, Arlindo L.; Freitas, Ana T. // BMC Bioinformatics;2011, Vol. 12 Issue 1, p163 

    Background: Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics