Searching Remote Homology with Spectral Clustering with Symmetry in Neighborhood Cluster Kernels

Maulik, Ujjwal; Sarkar, Anasua
February 2013
PLoS ONE;Feb2013, Vol. 8 Issue 2, p1
Academic Journal
Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of “recent” paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. Contact: sarkar@labri.fr.


Related Articles

  • Comprehensive Repertoire of Foldable Regions within Whole Genomes. Faure, Guilhem; Callebaut, Isabelle // PLoS Computational Biology;Oct2013, Vol. 9 Issue 10, p1 

    In order to get a comprehensive repertoire of foldable domains within whole proteomes, including orphan domains, we developed a novel procedure, called SEG-HCA. From only the information of a single amino acid sequence, SEG-HCA automatically delineates segments possessing high densities in...

  • Entropy-driven partitioning of the hierarchical protein space. Rappoport, Nadav; Stern, Amos; Linial, Nathan; Linial, Michal // Bioinformatics;Sep2014, Vol. 30 Issue 17, pi624 

    Motivation: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all...

  • Fast String Kernels using Inexact Matching for Protein Sequences. Leslie, Christina; Rui Kuang; Bennett, Kristin // Journal of Machine Learning Research;11/1/2004, Vol. 5 Issue 9, p1435 

    We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels -- restricted gappy kernels, substitution kernels, and wildcard...

  • DendroBLAST: Approximate Phylogenetic Trees in the Absence of Multiple Sequence Alignments. Kelly, Steven; Maini, Philip K. // PLoS ONE;Mar2013, Vol. 8 Issue 3, p1 

    The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method...

  • The Relationship between Gene Isoform Multiplicity, Number of Exons and Protein Divergence. Morata, Jordi; Béjar, Santi; Talavera, David; Riera, Casandra; Lois, Sergio; de Xaxars, Gemma Mas; de la Cruz, Xavier // PLoS ONE;Aug2013, Vol. 8 Issue 8, p1 

    At present we know that phenotypic differences between organisms arise from a variety of sources, like protein sequence divergence, regulatory sequence divergence, alternative splicing, etc. However, we do not have yet a complete view of how these sources are related. Here we address this...

  • Combining heterogeneous data sources for accurate functional annotation of proteins. Sokolov, Artem; Funk, Christopher; Graim, Kiley; Verspoor, Karin; Ben-Hur, Asa // BMC Bioinformatics;2013, Vol. 14 Issue S3, p1 

    Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension...

  • A functional hierarchical organization of the protein sequence space. Kaplan, Noam; Friedlich, Moriah; Fromer, Menachem; Linial, Michal // BMC Bioinformatics;2004, Vol. 5, p196 

    Background: It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high...

  • Incremental window-based protein sequence alignment algorithms. Huzefa Rangwala // Bioinformatics;Jan2007, Vol. 23 Issue 2, pe17 

    Motivation: Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling.Methods: We have developed novel sequence alignment...

  • Closing the side-chain gap in protein loop modeling. Rossi, Karen A.; Nayeem, Akbar; Weigelt, Carolyn A.; Krystek Jr., Stanley R. // Journal of Computer-Aided Molecular Design;Jul2009, Vol. 23 Issue 7, p411 

    The success of structure-based drug design relies on accurate protein modeling where one of the key issues is the modeling and refinement of loops. This study takes a critical look at modeled loops, determining the effect of re-sampling side-chains after the loop conformation has been generated....


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics