Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.

e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.

enter image description here

However, just as the scikit-learn website states:

While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional data.

AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.

e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.

So my question:

How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?

like image 429
Sibbs Gambling Avatar asked Sep 16 '13 08:09

Sibbs Gambling


People also ask

What is the best clustering algorithm for high dimensional data?

Graph-based clustering (Spectral, SNN-cliq, Seurat) is perhaps most robust for high-dimensional data as it uses the distance on a graph, e.g. the number of shared neighbors, which is more meaningful in high dimensions compared to the Euclidean distance.

What is meant by clustering high dimensional data?

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions.

Is hierarchical clustering good for high dimensional data?

Hierarchical clustering is extensively used to organize high dimensional objects such as documents and images into a structure which can then be used in a multitude of ways.

How do you choose a clustering technique?

To answer that question, we need to consider the algorithm, the data we are using, and the application being built. Taking all that into account, you will be able to choose a clustering algorithm that will make your analysis work fast and efficient. As always, remember that in data science, it's always about the data.


2 Answers

"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).

4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.

However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.

If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.

  • Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.

The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI

like image 166
Has QUIT--Anony-Mousse Avatar answered Oct 23 '22 01:10

Has QUIT--Anony-Mousse


There are at least two common, generic approaches:

  1. One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):

    • PCA - principal component analysis
    • SOM - self-organizing maps
    • Sammon's mapping
    • Autoencoder Neural Networks
    • KPCA - kernel principal component analysis
    • Isomap

    After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.

  2. One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)

like image 45
lejlot Avatar answered Oct 23 '22 03:10

lejlot