Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mini-batch k-means returns less than k clusters

I've been working with mini-batch k-means using the scikit-learn implementation to cluster datasets of about 45000 observations with about 170 features each. I noticed that the algorithm has trouble returning the specified number of clusters as k increases, and if k goes beyond about 30% of the number of observations in the dataset (30% of 45000) and continues to increase, the returned number of clusters does not increase anymore.

I was wondering if this has to do with the way the algorithm was implemented in scikit-learn or if it has to do with its definition. I've been studying the paper where it was proposed but I can't figure out why this would happen.

Has anyone experienced this? Does anyone now how to explain this behavior?

like image 593
c_david Avatar asked Jul 23 '14 19:07

c_david


People also ask

What is the difference between k-means and mini batch k-means clustering?

The mini batch K-means is faster but gives slightly different results than the normal batch K-means. Here we cluster a set of data, first with K-means and then with mini batch K-means, and plot the results. We will also plot the points that are labeled differently between the two algorithms.

What is mini batch k-means clustering?

The Mini-batch K-means clustering algorithm is a version of the standard K-means algorithm in machine learning. It uses small, random, fixed-size batches of data to store in memory, and then with each iteration, a random sample of the data is collected and used to update the clusters.

Why k-means clustering fail to give good results?

K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different and the data points follow non-convex shapes.

What is the main disadvantages of k-means clustering method?

k-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means as described in the Advantages section. Clustering outliers. Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored.


1 Answers

k-means can fail in the sense that clusters can disappear.

This is most evident when you have a lot of duplicates.

If all your data points are identical, why should there be more than one (non-empty) cluster, ever?

It's not specific to mini-batch k-means as far as I can tell. Some implementations let you specify what to do when a cluster degenerates, e.g. use the farthest point as new cluster center, discard the cluster, or leave it unchanged (maybe it will pick up a point again).

like image 90
Has QUIT--Anony-Mousse Avatar answered Nov 08 '22 07:11

Has QUIT--Anony-Mousse