Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to recognize a interior node having all its containing points in one cluster in a ball tree when doing k-means algorithm?

I am now reading the book Data Mining: Practical machine learning tools and techniques third edition. In the section 4.8 clustering, it discusses how to use k-d trees or ball trees to improve the performance for the k-means algorithm.

After building the ball tree with all the data points, it searches all the leaf nodes to see which pre-chosen clustering center the points in it are each close to. It says sometimes the region represented by the higher interior node falls entirely within the domain of a single cluster center. Then we needn't traverse its child nodes and all the date points can be processed in one blow.

The question is, when implementing the data structure and the algorithm, how can we decide whether the region referring to an interior node falls into a single cluster center?

In a two-dimensional or three-dimensional space, this is not difficult. We can see whether all the midperpendiculars of every pair in the cluster centres come across the region referring to the interior node.

But in higher dimensional spaces, how to recognize that? Is there a general methodology for this?

like image 318
keepItSimple Avatar asked Nov 04 '12 04:11

keepItSimple


People also ask

How many clusters are generated by the k-means algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

Which clustering algorithms permit you to decide the number of clusters after the clustering is done?

Hierarchical clustering does not require you to pre-specify the number of clusters, the way that k-means does, but you are selecting a number of clusters from your output.

What does the K represent in K means clustering?

K means clustering is an algorithm, where the main goal is to group similar data points into a cluster. In K means clustering, k represents the total number of groups or clusters. K means clustering runs on Euclidean distance calculation.


1 Answers

You need to consider maximum and minimum distances.

If the minimum distance of a spatial object (say, a sphere of radius r) to all other means is larger than the maximum distance to one, all objects inside the container will belong to that mean. Because if

maxdist(mean_i, container) < min of all j != i mindist(mean_j, container)

then in particular for any object in the container

dist(mean_i, obj_in_container) < min of all j != i dist(mean_j, obj_in_container)

I.e. the object will belong to mean i.

Minimum and Maximum distances for spheres and rectangles can be trivially computed in arbitrary dimensions. However, in higher dimensions, mindist and maxdist become quite similar, and the condition will rarely hold. Plus, it makes a huge difference if your tree is good structured (i.e. small containers) or badly structured (overlapping containers).

k-d-trees are nice for in-memory, read-only operations. For insertions they perform quite bad. R*-trees are here a lot better. Plus, the improved split strategy of R*-trees does pay off, because it generates more rectangular boxes than the other strategies.

like image 175
Has QUIT--Anony-Mousse Avatar answered Jan 02 '23 11:01

Has QUIT--Anony-Mousse