Clustering Method Selection in High-Dimension?

Tags:

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.

e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.

enter image description here

However, just as the scikit-learn website states:

While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional data.

AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.

e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.

So my question:

How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?

429

asked Sep 16 '13 08:09

Sibbs Gambling

2 Answers

"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).

4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.

However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.

If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.

Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.

The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI

166

answered Oct 23 '22 01:10

Has QUIT--Anony-Mousse

There are at least two common, generic approaches:

One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
- PCA - principal component analysis
- SOM - self-organizing maps
- Sammon's mapping
- Autoencoder Neural Networks
- KPCA - kernel principal component analysis
- Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)

answered Oct 23 '22 03:10

lejlot

Related questions
                            
                                Plot SVM in 3 dimension
                            
                                Machine learning: Supervised learning to learn & predict next RSA code
                            
                                Validating Output From a Clustering Algorithm
                            
                                Web page recommender system
                            
                                Classification of objects from a video ( human, animals, others(cars etc.,) ) [closed]
                            
                                Using Weka on Images
                            
                                Does random forest in R have a limitation of size of training data?
                            
                                How to not standarize target data in scikit learn regression
                            
                                Difference between segmentation and classification
                            
                                Support vector machine in Python using libsvm example of features
                            
                                Is it important for a neural network to have normally distributed data?
                            
                                Why pretraining for DNN is not specified in keras?
                            
                                Considerations for using ReLU as activation function
                            
                                My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%
                            
                                How to use IP address as a feature in a neural network
                            
                                How to interpret MSE in Keras Regressor
                            
                                tf.keras.models.save_model and optimizer warning
                            
                                How to get feature importance in logistic regression using weights?
                            
                                Test accuracy is greater than train accuracy what to do?
                            
                                Artificial Intelligence - Machine Learning [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Clustering Method Selection in High-Dimension?

Tags:

machine-learning

cluster-analysis

data-mining

Sibbs Gambling

People also ask

2 Answers

Has QUIT--Anony-Mousse

lejlot

Recent Activity

Donate For Us