Variable Selection for Clustering and Classification

Andrews, Jeffrey; McNicholas, Paul
July 2014
Journal of Classification;Jul2014, Vol. 31 Issue 2, p136
Academic Journal
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.


Related Articles

  • On L 1 Bounds for Asymptotic Normality of Some Weakly Dependent Random Variables. J. Sunklodas // Acta Applicandae Mathematica;May2008, Vol. 102 Issue 1, p87 

    Abstract  In the present paper, we consider L 1 bounds for asymptotic normality for the sequence of r.v.’s X 1,X 2,… (not necessarily stationary) satisfying the ψ-mixing condition. The L 1 bounds have been obtained in terms of Lyapunov fractions...

  • On Cluster Algebras Arising from Unpunctured Surfaces. Schiffler, Ralf; Thomas, Hugh // IMRN: International Mathematics Research Notices;Sep2009, Vol. 2009 Issue 17, p3160 

    We study cluster algebras that are associated to unpunctured surfaces, with coefficients arising from boundary arcs. We give a direct formula for the Laurent polynomial expansion of cluster variables in these cluster algebras in terms of certain paths on a triangulation of the surface. As an...

  • The negative association property for the absolute values of random variables equidistributed on a generalized Orlicz ball. Marcin Pilipczuk; Jakub Wojtaszczyk // Positivity;Jul2008, Vol. 12 Issue 3, p421 

    Abstract  Random variables equidistributed on convex bodies have received quite a lot of attention in the last few years. In this paper we prove the negative association property (which generalizes the subindependence of coordinate slabs) for generalized Orlicz balls. This allows us to...

  • Classification of Multivariate Objects Using Interval Quantile Classes. Młodak, Andrzej // Journal of Classification;2011, Vol. 28 Issue 3, p327 

    The paper contains a proposal of interval data clustering related to given social and economic objects characterized by many interval variables. This multivariate approach is based on an original conception of interval quantiles constructed using a special definition derived from the notion of...

  • On expansions of numbers in alternating s-adic series and Ostrogradskii series of the first and second kind. Prats'ovyta, I. M. // Ukrainian Mathematical Journal;Jul2009, Vol. 61 Issue 7, p1137 

    We present expansions of real numbers in alternating s-adic series (1 < s ∈ N), in particular, s-adic Ostrogradskii series of the first and second kind. We study the “geometry” of this representation of numbers and solve metric and probability problems, including the problem of...

  • SOME GOODNESS OF FIT TESTS FOR RANDOM SEQUENCES. Kozachenko, Yuriy; Ianevych, Tetiana // Lithuanian Journal of Statistics;2013, Vol. 52 Issue 1, p5 

    In this paper we had made an attempt to incorporate the results from the theory of square Gaussian random variables in order to construct the goodness of fits test for random sequences (time series). We considered two versions of such tests. The first one was designed for testing the adequacy of...

  • Reliability of Fatigue Damaged Structure Using FORM, SORM and Fatigue Model. Ouk Sub Lee; Dong Hyeok Kim // World Congress on Engineering 2007 (Volume 1);2007, p1322 

    The methodologies to calculate failure probability and to estimate the reliability of fatigue loaded structures are developed. The applicability of the methodologies is evaluated with the help of the fatigue crack growth models suggested by Paris and Walker. The probability theories such as the...

  • Generation of Four-Mode Continuous-Variable Cluster States. Ukai, R.; Yukawa, M.; Armstrong, S. C.; Yoshikawa, J.; van Loock, P.; Furusawa, A. // AIP Conference Proceedings;4/13/2009, Vol. 1110 Issue 1, p137 

    Cluster states are sufficient resources for realizing quantum computation. Their implementations can be achieved via either discrete-variable systems (especially qubit systems) or continuous-variable systems. Here we report on the experimental generation of an important example of a...

  • Reuse of imputed data in microarray analysis increases imputation efficiency. Ki-Yeol Kim; Byoung-Jin Kim; Gwan-Su Yi // BMC Bioinformatics;2004, Vol. 5, p160 

    Background: The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics