Partitioning clustering algorithms for protein sequence data sets

Fayech, Sondes; Essoussi, Nadia; Limam, Mohamed
January 2009
BioData Mining;2009, Vol. 2, p1
Academic Journal
Background: Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. Methods: We developed four partitioning clustering approaches using Smith-Waterman localalignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. Results: We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.


Related Articles

  • Light-speed genomics. Eisenstein, Michael // Nature Methods;Sep2005, Vol. 2 Issue 9, p646 

    A pico-scale reaction system serves as the basis for a new generation of high-throughput sequencing machines that promise to bring genome-center power down to the laboratory level.

  • Reference-Free Validation of Short Read Data. Schröder, Jan; Bailey, James; Conway, Thomas; Zobel, Justin // PLoS ONE;2010, Vol. 5 Issue 9, p1 

    Background: High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties...

  • Genomics: The personal side of genomics. Blow, Nathan // Nature;10/4/2007, Vol. 449 Issue 7162, p627 

    The article presents information on personal genomics, which is driven by DNA sequencing. It reports that different kinds of DNA-sequencing systems have been launched into the field of genetic-analysis. It informs that five companies are preparing to offer or are offering sequencers that are...

  • Dog Genome Assembled.  // Science Teacher;Sep2004, Vol. 71 Issue 7, p12 

    Reports on research in the U.S. which revealed that the first draft of the dog genome sequence has been deposited into free public databases for use by biomedical and veterinary researchers. Analyses of dog breeds; Identification of the boxer as one of the breeds with the least amount of...

  • The First Steps of Transposable Elements Invasion: Parasitic Strategy vs. Genetic Drift. Le Rouzic, Arnaud; Capy, Pierre // Genetics;Feb2005, Vol. 169 Issue 2, p1033 

    Transposable elements are often considered as selfish DNA sequences able to invade the genome of their host species. Their evolutive dynamics are complex, due to the interaction between their intrinsic amplification capacity, selection at the host level, transposition regulation, and genetic...

  • A novel approach for identifying candidate imprinted genes through sequence analysis of imprinted and control genes. Xiayi Ke; Thomas, N. Simon; Robinson, David O.; Collins, Andrew // Human Genetics;Dec2002, Vol. 111 Issue 6, p511 

    Through the sequence analysis of 27 imprinted human genes and a set of 100 control genes we have developed a novel approach for identifying candidate imprinted genes based on the differences in sequence composition observed. The imprinted genes were found to be associated with significantly...

  • upstream:. Schlegel, Rolf H. J. // Encyclopedic Dictionary of Plant Breeding & Related Subjects;2003, p426 

    A definition of the term "upstream" is presented. The term is used for description of the position of a DNA sequence within a DNA or protein molecule. It means that the position of the sequence lies away from the direction of the synthesis of a DNA

  • Genomics: Truth and accuracy.  // Nature;10/4/2007, Vol. 449 Issue 7162, p628 

    The article discusses the truth and accuracy of the various methods of environmental sampling of nucleic-acid sequences in the study of genomics, with a particular focus on the research conducted by Mitch Sogin, director of the Josephine Bay Paul Center for Comparative Molecular Biology and...

  • From Genome Sequence to Genome Understanding. Lombardi, Steve // PharmaGenomics;Sep2004 Supplement, Vol. 4, p25 

    This article describes the manufacturing methods and probe set strategies required to develop microarrays that can be used successfully in these experiments. Inexpensive, reliable and automated DNA sequencing methods allowed scientists to sequence the complete genomes of organisms, ranging from...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics