Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Yaniv Loewenstein; Elon Portugaly; Menachem Fromer; Michal Linial
July 2008
Bioinformatics;Jul2008, Vol. 24 Issue 13, pi41
Academic Journal
Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il


Related Articles

  • Computational structural and functional characterization of protein family: Key for the hidden mystery. Umang, Malhotra; Astha, Jaiswal; Aastha, Chhabra; Neha, Atale; Vibha, Rani // Journal of Pharmacy Research;2012, Vol. 5 Issue 7, p3643 

    With the accretion of raw biological data in the form of unannotated protein sequences, biologists are seeking solutions to unravel the functions of proteins which have not yet been characterized experimentally. The numerous bioinformatics tools available can be used to gain an insight into...

  • HangOut: generating clean PSI-BLAST profiles for domains with long insertions. Kim, Bong-Hyun; Cong, Qian; Grishin, Nick V. // Bioinformatics;Jun2010, Vol. 26 Issue 12, p1564 

    Summary: Profile-based similarity search is an essential step in structure-function studies of proteins. However, inclusion of non-homologous sequence segments into a profile causes its corruption and results in false positives. Profile corruption is common in multidomain proteins, and single...

  • ProtSA: a web application for calculating sequence specific protein solvent accessibilities in the unfolded ensemble. Estrada, Jorge; Bernadó, Pau; Blackledge, Martin; Sancho, Javier // BMC Bioinformatics;2009, Vol. 10, Special section p1 

    Background: The stability of proteins is governed by the heat capacity, enthalpy and entropy changes of folding, which are strongly correlated to the change in solvent accessible surface area experienced by the polypeptide. While the surface exposed in the folded state can be easily determined,...

  • PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines. Chatterjee, Piyali; Basu, Subhadip; Kundu, Mahantapas; Nasipuri, Mita; Plewczynski, Dariusz // Journal of Molecular Modeling;Sep2011, Vol. 17 Issue 9, p2191 

    Secondary structure prediction is a crucial task for understanding the variety of protein structures and performed biological functions. Prediction of secondary structures for new proteins using their amino acid sequences is of fundamental importance in bioinformatics. We propose a novel...

  • A Functional Proteomic Approach to the Identification and Characterization of Protein Composition in Wheat Leaf. Jung-Feng Hsieh; Shui-Tein Chen // Current Proteomics;Dec2008, Vol. 5 Issue 4, p253 

    Proteomics and bioinformatics approach were applied for the analyzing of wheat leaf proteins' composition and function. Wheat proteins were precipitated by ammonium sulfate and analyzed by two-dimensional gel electrophoresis and mass spectrometry. A total of 200 wheat proteins were selected to...

  • Low-homology protein threading. Jian Peng; Jinbo Xu // Bioinformatics;Jun2010, Vol. 26 Issue 12, pi294 

    Motivation: The challenge of template-based modeling lies in the recognition of correct templates and generation of accurate sequence-template alignments. Homologous information has proved to be very powerful in detecting remote homologs, as demonstrated by the state-of-the-art profile-based...

  • Accessibility and partner number of protein residues, their relationship and a webserver, ContPlot for their display. Pal, Arumay; Bahadur, Ranjit Prasad; Ray, Partha Sarathi; Chakrabarti, Pinak // BMC Bioinformatics;2009, Vol. 10, Special section p1 

    Background: Depending on chemical features residues have preferred locations -- interior or exterior -- in protein structures, which also determine how many other residues are found around them. The close packing of residues is the hallmark of protein interior and protein-protein interaction...

  • Motivated Proteins: A web application for studying small three-dimensional protein motifs. Leader, David P.; Milner-White, E. James // BMC Bioinformatics;2009, Vol. 10, Special section p1 

    Background: Small loop-shaped motifs are common constituents of the three-dimensional structure of proteins. Typically they comprise between three and seven amino acid residues, and are defined by a combination of dihedral angles and hydrogen bonding partners. The most abundant of these are...

  • High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABER-TOOTH. Teichert, Florian; Minning, Jonas; Bastolla, Ugo; Porto, Markus // BMC Bioinformatics;2010, Vol. 11, p251 

    Background: Protein alignments are an essential tool for many bioinformatics analyses. While sequence alignments are accurate for proteins of high sequence similarity, they become unreliable as they approach the so-called 'twilight zone' where sequence similarity gets indistinguishable from...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics