HIA: a genome mapper using hybrid index-based sequence alignment

Jongpill Choi; Kiejung Park; Seong Beom Cho; Myungguen Chung
December 2015
Algorithms for Molecular Biology;12/23/2015, Vol. 10, p1
Academic Journal
Background: A number of alignment tools have been developed to align sequencing reads to the human reference genome. The scale of information from next-generation sequencing (NGS) experiments, however, is increasing rapidly. Recent studies based on NGS technology have routinely produced exome or whole-genome sequences from several hundreds or thousands of samples. To accommodate the increasing need of analyzing very large NGS data sets, it is necessary to develop faster, more sensitive and accurate mapping tools. Results: HIA uses two indices, a hash table index and a suffix array index. The hash table performs direct lookup of a q-gram, and the suffix array performs very fast lookup of variable-length strings by exploiting binary search. We observed that combining hash table and suffix array (hybrid index) is much faster than the suffix array method for finding a substring in the reference sequence. Here, we defined the matching region (MR) is a longest common substring between a reference and a read. And, we also defined the candidate alignment regions (CARs) as a list of MRs that is close to each other. The hybrid index is used to find candidate alignment regions (CARs) between a reference and a read. We found that aligning only the unmatched regions in the CAR is much faster than aligning the whole CAR. In benchmark analysis, HIA outperformed in mapping speed compared with the other aligners, without significant loss of mapping accuracy. Conclusions: Our experiments show that the hybrid of hash table and suffix array is useful in terms of speed for mapping NGS sequencing reads to the human reference genome sequence. In conclusion, our tool is appropriate for aligning massive data sets generated by NGS sequencing.


Related Articles

  • Development of the clinical next-generation sequencing industry in a shifting policy climate. Curnutte, Margaret A; Frumovitz, Karen L; Bollinger, Juli M; McGuire, Amy L; Kaufman, David J // Nature Biotechnology;Oct2014, Vol. 32 Issue 10, p980 

    The article discusses the development of the next-generation sequencing (NGS) technologies which are increasingly being integrated into clinical practice. An agreement with he NGS industry over too keen regulation would hinder the development of clinical NGS in the short period of time. However,...

  • Faster sequence homology searches by clustering subsequences. Shuji Suzuki; Masanori Kakuta; Takashi Ishida; Yutaka Akiyama // Bioinformatics;4/1/2015, Vol. 31 Issue 7, p1183 

    Motivation: Sequence homology searches are used in various fields. New sequencing technologies produce huge amounts of sequence data, which continuously increase the size of sequence databases. As a result, homology searches require large amounts of computational time, especially for metagenomic...

  • Halvade: scalable sequence analysis with MapReduce. Decap, Dries; Reumers, Joke; Herzeel, Charlotte; Costanza, Pascal; Fostier, Jan // Bioinformatics;8/1/2015, Vol. 31 Issue 15, p2482 

    Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that...

  • MSA-PAD: DNA multiple sequence alignment framework based on PFAM accessed domain information. Balech, Bachir; Vicario, Saverio; Donvito, Giacinto; Monaco, Alfonso; Notarangelo, Pasquale; Pesole, Graziano // Bioinformatics;8/1/2015, Vol. 31 Issue 15, p2571 

    Here we present the MSA-PAD application, a DNA multiple sequence alignment framework that uses PFAM protein domain information to align DNA sequences encoding either single or multiple protein domains. MSA-PAD has two alignment options: gene and genome mode.

  • A method for enhancement of short read sequencing alignment with Bayesian inference. Weixing Feng; Fengfei Song; Yansheng Dong; Bo He // Journal of Chemical & Pharmaceutical Research;2013, Vol. 5 Issue 11, p200 

    Next-generation short read sequencing is widely utilized in genome wide association study. However, as an indirect measurement technique, short read sequencing requires alignment step to map all sequencing reads to reference genome before acquiring interested genomic information. Facing to huge...

  • Lambda: the local aligner for massive biological data. Hauswedell, Hannes; Singer, Jochen; Reinert, Knut // Bioinformatics;Sep2014, Vol. 30 Issue 17, pi349 

    Motivation: Next-generation sequencing technologies produce unprecedented amounts of data, leading to completely new research fields. One of these is metagenomics, the study of large-size DNA samples containing a multitude of diverse organisms. A key problem in metagenomics is to functionally...

  • BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithm. Loving, Joshua; Hernandez, Yozen; Benson, Gary // Bioinformatics;Nov2014, Vol. 30 Issue 22, p3166 

    Motivation: Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high-efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and...

  • Fast construction of FM-index for long sequence reads. Li, Heng // Bioinformatics;Nov2014, Vol. 30 Issue 22, p3274 

    Summary: We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a...

  • GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Shunichi Kosugi; Hideki Hirakawa; Satoshi Tabata // Bioinformatics;12/1/2015, Vol. 31 Issue 23, p3733 

    Motivation: Genome assemblies generated with next-generation sequencing (NGS) reads usually contain a number of gaps. Several tools have recently been developed to close the gaps in these assemblies with NGS reads. Although these gap-closing tools efficiently close the gaps, they entail a high...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics