Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which algorithm and what combination of hyper-parameters will be the best to cluster this data?

I was learning about non-linear clustering algorithms and I came across this 2-D graph. I was wondering which clustering alogirthm and combination of hyper-parameters will cluster this data well.

Plot

Just like a human will cluster those 5 spikes. I want my algorithm to do it. I tried KMeans but it was only clustering horizontly or vertically. I started using GMM but couldn't get the hyper-parameters right for the desired clustering.

like image 663
rrm_2016 Avatar asked May 31 '19 12:05

rrm_2016


People also ask

Which algorithm is used to cluster the given data?

K-means clustering is the most commonly used clustering algorithm. It's a centroid-based algorithm and the simplest unsupervised learning algorithm. This algorithm tries to minimize the variance of data points within a cluster.

Which is best clustering algorithm and why?

The DBSCAN is better than other cluster algorithms because it does not require a pre-set number of clusters. It identifies outliers as noise, unlike the Mean-Shift method that forces such points into the cluster in spite of having different characteristics. It finds arbitrarily shaped and sized clusters quite well.

What is the best clustering algorithm for high-dimensional data?

Graph-based clustering (Spectral, SNN-cliq, Seurat) is perhaps most robust for high-dimensional data as it uses the distance on a graph, e.g. the number of shared neighbors, which is more meaningful in high dimensions compared to the Euclidean distance.

Which clustering algorithm should I use?

K-Means is probably the most well-known clustering algorithm. It’s taught in a lot of introductory data science and machine learning classes. It’s easy to understand and implement in code! Check out the graphic below for an illustration.

What are hyperparameters in machine learning?

Hyperparameters are different from parameters, which are the internal coefficients or weights for a model found by the learning algorithm. Unlike parameters, hyperparameters are specified by the practitioner when configuring the model.

What is combined algorithm selection and hyperparameter optimization?

A modern alternative is to consider the selection of data preparation, learning algorithm, and algorithm hyperparameters one large global optimization problem. This characterization is generally referred to as Combined Algorithm Selection and Hyperparameter Optimization, or “ CASH Optimization ” for short.

Are all model hyperparameters equally important?

Not all model hyperparameters are equally important. Some hyperparameters have an outsized effect on the behavior, and in turn, the performance of a machine learning algorithm. As a machine learning practitioner, you must know which hyperparameters to focus on to get a good result quickly.


2 Answers

If it doesn't work, always try to improve the preprocessing first. Algorithms such as k-means are very sensitive to scaling, so that is something that needs to be chosen carefully.

GMM is clearly your first choice here. It may be worth trying out different tools. R's Mclust is very slow. Sklearn's GMM is sometimes unstable. ELKI is a bit harder to get started with, but its EM gave me the best results usually.

Apart from GMM, it likely is worth trying out correlation clustering. These algorithms assume there is some manifold (e.g., a line) on which a cluster exists. Examples include ORCLUS, LMCLUS, CASH, 4C, ... But in my opinion these mostly work for synthetic toy data.

like image 111
Has QUIT--Anony-Mousse Avatar answered Sep 22 '22 16:09

Has QUIT--Anony-Mousse


I will suggest trying out hierarchical clustering. In the Agglomerative approach, you will assign individual clusters to each point, and then combine clusters based on their distances from each other.

like image 20
Abhineet Gupta Avatar answered Sep 19 '22 16:09

Abhineet Gupta