Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption?

Fegley, Brent D.; Torvik, Vetle I.
July 2013
PLoS ONE;Jul2013, Vol. 8 Issue 7, p1
Academic Journal
The assumption that a name uniquely identifies an entity introduces two types of errors: splitting treats one entity as two or more (because of name variants); lumping treats multiple entities as if they were one (because of shared names). Here we investigate the extent to which splitting and lumping affect commonly-used measures of large-scale named-entity networks within two disambiguated bibliographic datasets: one for co-author names in biomedicine (PubMed, 2003–2007); the other for co-inventor names in U.S. patents (USPTO, 2003–2007). In both cases, we find that splitting has relatively little effect, whereas lumping has a dramatic effect on network measures. For example, in the biomedical co-authorship network, lumping (based on last name and both initials) drives several measures down: the global clustering coefficient by a factor of 4 (from 0.265 to 0.066); degree assortativity by a factor of ∼13 (from 0.763 to 0.06); and average shortest path by a factor of 1.3 (from 5.9 to 4.5). These results can be explained in part by the fact that lumping artificially creates many intransitive relationships and high-degree vertices. This effect of lumping is much less dramatic but persists with measures that give less weight to high-degree vertices, such as the mean local clustering coefficient and log-based degree assortativity. Furthermore, the log-log distribution of collaborator counts follows a much straighter line (power law) with splitting and lumping errors than without, particularly at the low and the high counts. This suggests that part of the power law often observed for collaborator counts in science and technology reflects an artifact: name ambiguity.


Related Articles

  • Estimating Illumination Distribution to Generate Realistic Shadows in Augmented Reality. Changkyoung Eem; Iksu Kim; Yeongseok Jung; Hyunki Hong // KSII Transactions on Internet & Information Systems;Jun2015, Vol. 9 Issue 6, p2289 

    Mobile devices are becoming powerful enough to realize augmented reality (AR) application. This paper introduces two AR methods to estimate an environmental illumination distribution of a scene. In the first method, we extract the lighting direction and intensity from input images captured with...

  • CAViz, an Interactive Graphical Tool for Image Mining. Pham, Nguyen-Khang; Morin, Annie; Gros, Patrick // Journal of Computing & Information Technology;Dec2008, Vol. 16 Issue 4, p295 

    We propose an interactive graphical tool, CAViz, which allows us to display and to extract knowledge from the results of a Correspondence Analysis CA on images. CA is a descriptive technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between...

  • WS-DAI-DM: An Interface Specification for Data Mining in Grid Environments. Yan Zhang; Luoming Meng; Honghui Li; Woehrer, Alexander; Brezany, Peter // Journal of Software (1796217X);Jun2011, Vol. 6 Issue 6, p953 

    Providing the appropriate access means for data mining services in Grid Environment is principal for combination of Grid and data mining. The transition from centralized data mining process as they are in traditional tools to Grid-compliant and Grid-based data mining services that can coordinate...

  • DATA MINING PRACTICES: A STUDY PAPER. AYSHWARYA, B. // International Journal of Research in Commerce, IT & Management;Dec2014, Vol. 4 Issue 12, p41 

    In this paper, the thought of data mining was précised and its importance towards its methodologies was showed. The data mining based on Neural Network and Genetic Algorithm is researched in detail and the key technology and ways to achieve the data mining on Neural Network and Genetic...

  • An Extension Collaborative Innovation Model in the Context of Big Data. Li, Xingsen; Tian, Yingjie; Smarandache, Florentin; Alex, Rajan // International Journal of Information Technology & Decision Makin;Jan2015, Vol. 14 Issue 1, p69 

    The processes of generating innovative solutions mostly rely on skilled experts who are usually unavailable and their outcomes have uncertainty. Computer science and information technology are changing the innovation environment and accumulating Big Data from which a lot of knowledge is to be...

  • Causal Reasoning with Ancestral Graphs. Jiji Zhang // Journal of Machine Learning Research;7/1/2008, Vol. 9 Issue 7, p1437 

    Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One...

  • Incremental Learning on Non-stationary Data Stream Using Ensemble Approach. Thalor, Meenakshi Anurag; Patil, Shrishailapa // International Journal of Electrical & Computer Engineering (2088;Aug2016, Vol. 6 Issue 4, p1811 

    Incremental Learning on non stationary distribution has been shown to be a very challenging problem in machine learning and data mining, because the joint probability distribution between the data and classes changes over time. Many real time problems suffer concept drift as they changes with...

  • An Analysis on The Perception and Image of Prestigious Public Higher Learning Institutions (IPTA) in Malaysia From Development Perpective. Muslim, Nazri // Journal of Applied Sciences Research;Aug2012, Vol. 8 Issue 7, p3277 

    In today's era of information technology, each and every organization must have the element of competitive edge. Image, which plays a vital role in marketing activities, is clearly an added advantage, one that is strongly needed by such organization. Hence, this study intends to measure the...

  • Data Stream Management: Aggregation, Classification, Modeling, and Operator Placement. Olken, Frank; Gruenwald, Le // IEEE Internet Computing;Nov/Dec2008, Vol. 12 Issue 6, p9 

    The article discusses various reports published within the issue, including one on "Time-Stamp Management and Query Execution in Data Stream Management Systems," by Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo and another one on "Classifying Data Streams with Skewed Class...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics