Unsupervised clustering with unknown number of clusters

Tags:

I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less than a threshold "T".

I do not know how many clusters exist. At the end, there may be individual vectors existing that are not part of any cluster because its euclidean distance is not less than "T" with any of the vectors in the space.

What existing algorithms / approach should be used here?

400

asked Apr 13 '12 06:04

London guy

1 Answers

You can use hierarchical clustering. It is a rather basic approach, so there are lots of implementations available. It is for example included in Python's scipy.

See for example the following script:

import matplotlib.pyplot as plt import numpy import scipy.cluster.hierarchy as hcluster  # generate 3 clusters of each around 100 points and one orphan point N=100 data = numpy.random.randn(3*N,2) data[:N] += 5 data[-N:] += 10 data[-1:] -= 20  # clustering thresh = 1.5 clusters = hcluster.fclusterdata(data, thresh, criterion="distance")  # plotting plt.scatter(*numpy.transpose(data), c=clusters) plt.axis("equal") title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters))) plt.title(title) plt.show()

Which produces a result similar to the following image. clusters

The threshold given as a parameter is a distance value on which basis the decision is made whether points/clusters will be merged into another cluster. The distance metric being used can also be specified.

Note that there are various methods for how to compute the intra-/inter-cluster similarity, e.g. distance between the closest points, distance between the furthest points, distance to the cluster centers and so on. Some of these methods are also supported by scipys hierarchical clustering module (single/complete/average... linkage). According to your post I think you would want to use complete linkage.

Note that this approach also allows small (single point) clusters if they don't meet the similarity criterion of the other clusters, i.e. the distance threshold.

There are other algorithms that will perform better, which will become relevant in situations with lots of data points. As other answers/comments suggest you might also want to have a look at the DBSCAN algorithm:

https://en.wikipedia.org/wiki/DBSCAN
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN

For a nice overview on these and other clustering algorithms, also have a look at this demo page (of Python's scikit-learn library):

http://scikit-learn.org/stable/modules/clustering.html

Image copied from that place:

As you can see, each algorithm makes some assumptions about the number and shape of the clusters that need to be taken into account. Be it implicit assumptions imposed by the algorithm or explicit assumptions specified by parameterization.

143

answered Sep 22 '22 16:09

moooeeeep

Related questions
                            
                                Find the smallest positive integer that does not occur in a given sequence
                            
                                Find all paths between two graph nodes
                            
                                How to detect patterns in (electrocardiography) waves?
                            
                                Shortest Sudoku Solver in Python - How does it work?
                            
                                Efficiently find binary strings with low Hamming distance in large set
                            
                                How is CPU usage calculated?
                            
                                Sort on a string that may contain a number
                            
                                How to rank a million images with a crowdsourced sort
                            
                                Take n random elements from a List<E>?
                            
                                How to make a for loop variable const with the exception of the increment statement?
                            
                                Differences between OT and CRDT
                            
                                What is the minimum cost to connect all the islands?
                            
                                How to understand the knapsack problem is NP-complete?
                            
                                Comparing object graph representation to adjacency list and matrix representations
                            
                                Support Resistance Algorithm - Technical analysis
                            
                                Rounding to an arbitrary number of significant digits
                            
                                Count number of 1's in binary representation
                            
                                Interview Question: Merge two sorted singly linked lists without creating new nodes
                            
                                Why does the greedy coin change algorithm not work for some coin sets?
                            
                                Is it faster to sort a list after inserting items or adding them to a sorted list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unsupervised clustering with unknown number of clusters

Tags:

algorithm

math

artificial-intelligence

machine-learning

cluster-analysis

London guy

People also ask

1 Answers

moooeeeep

Recent Activity

Donate For Us