Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory

Bull, Simon C.; Muldoon, Mark R.; Doig, Andrew J.
February 2013
PLoS ONE;Feb2013, Vol. 8 Issue 2, p1
Academic Journal
Analysis of protein data sets often requires prior removal of redundancy, so that data is not biased by containing similar proteins. This is usually achieved by pairwise comparison of sequences, followed by purging so that no two pairs have similarities above a chosen threshold. From a starting set, such as the PDB or a genome, one should remove as few sequences as possible, to give the largest possible non-redundant set for subsequent analysis. Protein redundancy can be represented as a graph, with proteins as nodes connected by undirected edges, if they have a pairwise similarity above the chosen threshold. The problem is then equivalent to finding the maximum independent set (MIS), where as few nodes are removed as possible to remove all edges. We tested seven MIS algorithms, three of which are new. We applied the methods to the PDB, subsets of the PDB, various genomes and the BHOLSIB benchmark datasets. For PDB subsets of up to 1000 proteins, we could compare to the exact MIS, found by the Cliquer algorithm. The best algorithm was the new method, Leaf. This works by adding clique members that have no edges to nodes outside the clique to the MIS, starting with the smallest cliques. For PDB subsets of up to 1000 members, it usually finds the MIS and is fast enough to apply to data sets of tens of thousands of proteins. Leaf gives sets that are around 10% larger than the commonly used PISCES algorithm, that are of identical quality. We therefore suggest that Leaf should be the method of choice for generating non-redundant protein data sets, though it is ineffective on dense graphs, such as the BHOLSIB benchmarks. The Leaf algorithm is available at: https://github.com/SimonCB765/Leaf, and sets from genomes and the PDB are available at: http://www.bioinf.manchester.ac.uk/leaf/.


Related Articles

  • Are graph databases ready for bioinformatics? Have, Christian Theil; Jensen, Lars Juhl // Bioinformatics;Dec2013, Vol. 29 Issue 24, p3107 

    Contact: Lars.Juhl.Jensen@gmail.com

  • Polytomy refinement for the correction of dubious duplications in gene trees. Lafond, Manuel; Chauve, Cedric; Dondi, Riccardo; El-Mabrouk, Nadia // Bioinformatics;Sep2014, Vol. 30 Issue 17, pi519 

    Motivation: Large-scale methods for inferring gene trees are error-prone. Correcting gene trees for weakly supported features often results in non-binary trees, i.e. trees with polytomies, thus raising the natural question of refining such polytomies into binary trees. A feature pointing toward...

  • Accelerated protein structure comparison using TM-score-GPU. Hung, Ling-Hong; Samudrala, Ram // Bioinformatics;Aug2012, Vol. 28 Issue 16, p2191 

    Motivation: Accurate comparisons of different protein structures play important roles in structural biology, structure prediction and functional annotation. The root-mean-square-deviation (RMSD) after optimal superposition is the predominant measure of similarity due to the ease and speed of...

  • Communication Routes in ARID Domains between Distal Residues in Helix 5 and the DNA-Binding Loops. Invernizzi, Gaetano; Tiberti, Matteo; Lambrughi, Matteo; Lindorff-Larsen, Kresten; Papaleo, Elena // PLoS Computational Biology;Sep2014, Vol. 10 Issue 9, p1 

    ARID is a DNA-binding domain involved in several transcriptional regulatory processes, including cell-cycle regulation and embryonic development. ARID domains are also targets of the Human Cancer Protein Interaction Network. Little is known about the molecular mechanisms related to...

  • Amply regular graphs whose local subgraphs are pseudogeometric graphs for pG( s, t). Gutnova, A.; Makhnev, A. // Doklady Mathematics;Jan2014, Vol. 89 Issue 1, p38 

    The article focuses on the amply regular graphs whose local subgraphs are pseudogeometric graphs.

  • On graphs in which all neighborhoods of vertices are locally pseudocyclic graphs. Kabanov, V.; Makhnev, A. // Doklady Mathematics;Jan2014, Vol. 89 Issue 1, p76 

    The article focuses on the graphs in which all neighborhoods of vertices are locally pseudocyclic graphs.

  • BioAssemblyModeler (BAM): User-Friendly Homology Modeling of Protein Homo- and Heterooligomers. Shapovalov, Maxim V.; Wang, Qiang; Xu, Qifang; Andrake, Mark; Dunbrack Jr, Roland L. // PLoS ONE;Jun2014, Vol. 9 Issue 6, p1 

    Many if not most proteins function in oligomeric assemblies of one or more protein sequences. The Protein Data Bank provides coordinates for biological assemblies for each entry, at least 60% of which are dimers or larger assemblies. BioAssemblyModeler (BAM) is a graphical user interface to the...

  • Protein structure networks. Greene, Lesley H. // Briefings in Functional Genomics;Nov2012, Vol. 11 Issue 6, p469 

    The application of the field of network science to the scientific disciplines of structural biology and biochemistry, have yielded important new insights into the nature and determinants of protein structures, function, dynamics and the folding process. Advancements in further understanding...

  • CLCAs - A Family of Metalloproteases of Intriguing Phylogenetic Distribution and with Cases of Substituted Catalytic Sites Lenart, Anna; Dudkiewicz, Małgorzata; Grynberg, Marcin; Pawłowski, Krzysztof // PLoS ONE;May2013, Vol. 8 Issue 5, p1 

    The zinc-dependent metalloproteases with His-Glu-x-x-His (HExxH) active site motif, zincins, are a broad group of proteins involved in many metabolic and regulatory functions, and found in all forms of life. Human genome contains more than 100 genes encoding proteins with known zincin-like...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics