Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set a minimum number of observations per clusters in k-means clustering?

I am trying to cluster some products based on the users' behaviors. What I reach at the end are clusters that have a very different number of observations.

I have checked k-means clustering parameters and was not able to find a parameter that controls the minimum (or maximum) number of observations per cluster.

For example here is how the number of observations is distributed across different clusters.

cluster_id   num_observations
0   6
1   4
2   1
3   3
4   29
5   5

How to deal with this issue?

like image 373
aghd Avatar asked May 01 '19 00:05

aghd


2 Answers

For those who still looking for an answer. I found a good module or this module that deal with this kind of problem

Use pip install size-constrained-clustering or pip install git+https://github.com/jingw2/size_constrained_clustering.git and use MinMaxKMeansMinCostFlow where you can select the size_min and size_max

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
like image 185
I_Al-thamary Avatar answered Nov 12 '22 01:11

I_Al-thamary


This will solve by k-means-constrained pip library.. check here

Example:

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
...     n_clusters=2,
...     size_min=2,
...     size_max=5,
...     random_state=0
... )
>>> clf.fit_predict(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
like image 32
Gihan Gamage Avatar answered Nov 11 '22 23:11

Gihan Gamage