Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for clustering with minimum size constraints

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m

I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched individually because it'll violate the size constraint.

Goal: minimize the sum of distance from each point to its cluster center.

Subject to: Minimum cluster size m

I want to find an algorithm to reassign all points without violating the constraint, while guaranteed to decrease the objective.

I thought of using Graph to represent pair wise relationship between points. But I'm not sure how to do the reassignment since it exists the possibility of a big dense cycle, and I got lost in swapping multiple points between multiple clusters.

I also created a list of cluster pairs with possible swapping candidates, but still couldn't find the way to optimal the objective.

I hope I explained my situation. I'm new to algorithm, and not familiar with the jargon and rules. If any other information needed, please let me know.

I've done a lot of research, I've tried out the algorithm in this paper, but without success, since the sum of membership degree doesn't necessarily correlate to cluster size. Clustering with Size Constraints

I've also read other similar posts on SO, but didn't find a detail algorithm I could implement.

I've tried to construct a weighted directed graph, with vertex representing clusters and edges from A to B represents points in cluster A willing to relocate to cluster B. and weight to be the sum of points

But with my data, it turns out all the nodes to be in a huge cycle with very dense edges. Because of my limited experience, I still could not figure out how to reassign among so many clusters. Any suggestions are appreciated!

Something like this.
enter image description here

like image 789
qshng Avatar asked May 07 '15 22:05

qshng


People also ask

What is minimum cluster size?

Basically, the minimum cluster size is estimated using a heuristic that takes in consideration the functional relation between the number of clusters and the clusters with at least a given number of points.

What is the minimum number of variables required for clustering?

At least a single variable is required to perform clustering analysis. Clustering analysis with a single variable can be visualized with the help of a histogram.

Which clustering algorithm is best for large datasets?

CLARA (clustering large applications.) It is a sample-based method that randomly selects a small subset of data points instead of considering the whole observations, which means that it works well on a large dataset.


2 Answers

To get a minimal (unfortunately not minimum) solution:

  1. First, greedily recluster any points that you can without violating the minimum size constraint.
  2. Next, build a directed multigraph as follows:
    1. Every cluster becomes a node.
    2. An edge (A,B) is added for every point in A that is closer to the center of B (so that there are potentially multiple edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
  3. Looking for cycles in this graph will allow you to find moves (where a move consists of moving every vertex in the cycle).
  4. Pick the cycle with the highest total weight and recluster the nodes corresponding to the edges.
  5. Repeat steps 1-4 until there are no more cycles.

Creating the graph will have a complexity of O(kn), where you have k and n total points, and can create the same number of multiedges. Tarjan's algorithm will have complexity of O(k2), assuming that you skip multiedges to the same destination in the DFS. Every time you eliminate a cycle, you decrease the global distance by some amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O(k4m2). That is quite extravagant; I'm sure there could be heuristics (and probably more detailed analysis) to improve the lower bound.

like image 98
Kittsil Avatar answered Oct 14 '22 22:10

Kittsil


Try this: pip install k-means-constrained and then

from k_means_constrained import KMeansConstrained
KMeansConstrained(n_clusters=8, size_min=None, size_max=None, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=False, random_state=None, copy_x=True, n_jobs=1)

sources:

https://pypi.org/project/k-means-constrained/

https://joshlk.github.io/k-means-constrained/

like image 22
farzaneh najafi Avatar answered Oct 14 '22 23:10

farzaneh najafi