An effective web document clustering algorithm based on bisection and merge

Ingyu Lee; Byung-Won On
June 2011
Artificial Intelligence Review;Jun2011, Vol. 36 Issue 1, p69
Academic Journal
To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.


Related Articles

  • QUALITY EVALUATION OF CLUSTERING ALGORITHMS. Sambyalov, Z. G.; Bilgaeva, L. P. // Bulletin of the East Siberian State University of Technology / V;Nov/Dec2013, Vol. 45 Issue 6, p54 

    This article describes the most well-known and widely used clustering algorithms designed to handle numerical and categorical data. Algorithms were tested on artificial and real data. According to test results clustering quality assessment and scenario of clustering algorithms with a view of...

  • HYBRID ANT-BASED CLUSTERING ALGORITHM WITH CLUSTER ANALYSIS TECHNIQUES. Omar, Wafa'a; Badr, Amr; El-Fattah Hegazy, Abd // Journal of Computer Science;Jun2013, Vol. 9 Issue 6, p780 

    Cluster analysis is a data mining technology designed to derive a good understanding of data to solve clustering problems by extracting useful information from a large volume of mixed data elements. Recently, researchers have aimed to derive clustering algorithms from nature's swarm behaviors....

  • MULTI-DENSITY DBSCAN USING REPRESENTATIVES: MDBSCAN-UR. Ahmed, Rwand; El-Zaza, Eman; Ashour, Wesam // Computing & Information Systems;Oct2011, Vol. 15 Issue 2, p1 

    DBSCAN is one of the most popular algorithms for cluster analysis. It can discover clusters with arbitrary shape and separate noises. But this algorithm cannot choose its parameter according to distributing of dataset. It simply uses the global uses minimum number of points (MinPts) parameter,...

  • AVOIDING NOISE AND OUTLIERS IN K-MEANS. Jnena, Rami; Timraz, Mohammed; Ashour, Wesam // Computing & Information Systems;Oct2011, Vol. 15 Issue 2, p1 

    Applying k-means algorithm on the datasets that include large number of noise and outlier objects, gives unclear clusters results. In this paper we proposed a new technique for avoiding these noise and outliers by applying some preprocessing and post processing steps for the dataset that have to...

  • K-Means for Spherical Clusters with Large Variance in Sizes. Fahim, A. M.; Saake, G.; Salem, A. M.; Torkey, F. A.; Ramadan, M. A. // International Journal of Computer Science;2009, Vol. 4 Issue 3, p145 

    Data clustering is an important data exploration technique with many applications in data mining. The k-means algorithm is well known for its efficiency in clustering large data sets. However, this algorithm is suitable for spherical shaped clusters of similar sizes and densities. The quality of...

  • DERIVING CLUSTER KNOWLEDGE USING ROUGH SET THEORY. Upadhyaya, Shuchita; Arora, Alka; Jain, Rajni // Journal of Theoretical & Applied Information Technology;2008, Vol. 4 Issue 8, p688 

    Clustering algorithms gives general description of the clusters listing number of clusters and member entities in those clusters. It lacks in generating cluster description in the form of pattern. Deriving pattern from clusters along with grouping of data into clusters is important from data...

  • Avoiding Objects with few Neighbors in the K-Means Process and Adding ROCK Links to Its Distance. Alnabriss, Hadi A.; Ashour, Wesam // International Journal of Computer Applications;Aug2011, Vol. 28, p12 

    K-means is considered as one of the most common and powerful algorithms in data clustering, in this paper we're going to present new techniques to solve two problems in the K-means traditional clustering algorithm, the 1st problem is its sensitivity for outliers, in this part we are going to...

  • Semantic based Document Clustering: A Detailed Review. Shah, Neepa; Mahajan, Sunita // International Journal of Computer Applications;8/15/2012, Vol. 52, p42 

    Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low...

  • Impact of Datamining Techniques in Forecasting Plant Disease. Rajalakshmi, R.; Uma, M.; Thangadurai, K.; Punithavalli, M. // International Journal of Advanced Research in Computer Science;Nov/Dec2012, Vol. 3 Issue 6, p187 

    In this article a challenge has been made to analysis the explore studies on significance of data mining techniques in the field of agriculture. Couple of the techniques, such as decision algorithms ID3, the CHAID algorithm, C4.5, and Cluster analysis applied in the field of agriculture was...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics