Aligning Sequences by Minimum Description Length

Conery, John S.
January 2007
EURASIP Journal on Bioinformatics & Systems Biology;2007, p1
Academic Journal
This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.


Related Articles

  • ZifBASE: a database of zinc finger proteins and associated resources. Jayakanthan, Mannu; Muthukumaran, Jayaraman; Chandrasekar, Sanniyasi; Chawla, Konika; Punetha, Ankita; Sundar, Durai // BMC Genomics;2009, Vol. 10, p421 

    Background: Information on the occurrence of zinc finger protein motifs in genomes is crucial to the developing field of molecular genome engineering. The knowledge of their target DNA-binding sequences is vital to develop chimeric proteins for targeted genome engineering and site-specific gene...

  • Prediction of disease-related mutations affecting protein localization. Laurila, Kirsti; Vihinen, Mauno // BMC Genomics;2009, Vol. 10, Special section p1 

    Background: Eukaryotic cells contain numerous compartments, which have different protein constituents. Proteins are typically directed to compartments by short peptide sequences that act as targeting signals. Translocation to the proper compartment allows a protein to form the necessary...

  • Alternative translation start sites and their significance for eukaryotic proteomes. Kochetov, A. V. // Molecular Biology;Sep2006, Vol. 40 Issue 5, p705 

    The review is dedicated to the current notions of translation initiation and the contextual organization of eukaryotic mRNA leader regions. A hypothesis on the frequent usage of several alternative start codons is discussed. A potential contribution of alternative translation start sites to the...

  • Differential expression of two β-amylase genes ( Bmy1 and Bmy2) in developing and mature barley grain. Vinje, Marcus A.; Willis, David K.; Duke, Stanley H.; Henson, Cynthia A. // Planta;Apr2011, Vol. 233 Issue 5, p1001 

    Two barley ( Hordeum vulgare L.) β-amylase genes ( Bmy1 and Bmy2) were studied during the late maturation phase of grain development in four genotypes. The Bmy1 and Bmy2 DNA and amino acid sequences are extremely similar. The largest sequence differences are in the introns, seventh exon, and...

  • Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. Raghava, Gajendra P. S.; Han, Joon H. // BMC Bioinformatics;2005, Vol. 6, p59 

    Background: A large number of papers have been published on analysis of microarray data with particular emphasis on normalization of data, detection of differentially expressed genes, clustering of genes and regulatory network. On other hand there are only few studies on relation between...

  • Influence of intron length on interaction characters between post-spliced intron and its CDS in ribosomal protein genes. Zhao, Xiaoqing; Li, Hong; Bao, Tonglaga; Ying, Zhiqiang // AIP Conference Proceedings;Sep2012, Vol. 1479 Issue 1, p1564 

    Many experiment evidences showed that sequence structures of introns and intron loss/gain can influence gene expression, but current mechanisms did not refer to the functions of post-spliced introns directly. We propose that postspliced introns play their functions in gene expression by...

  • Convergence pattern studies of folate receptor. Ramamoorthy, Kalidoss; Verma, Rama Shanker // Bioinformation;2008, Vol. 3 Issue 4, p168 

    Gene patterns and sequences of folic acid synthesizing genes that are converged as meaningful patterns during evolution in the higher eukaryotes has been identified using sequence alignment and pattern analysis. Based on the finding, we are postulating that part of genes that are involved in...

  • SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. Falgueras, Juan; Lara, Antonio J.; Fernández-Pozo, Noé; Cantón, Francisco R.; Pérez-Trabado, Guillermo; Claros, M. Gonzalo // BMC Bioinformatics;2010, Vol. 11, p38 

    Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyro-sequencing enhances this problem and...

  • Predicting protein-protein interactions in unbalanced data using the primary structure of proteins. Chi-Yuan Yu; Lih-Ching Chou; Chang, Darby Tien-Hao // BMC Bioinformatics;2010, Vol. 11, p167 

    Background: Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics