Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Online k-means clustering

Is there a online version of the k-Means clustering algorithm?

By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time.

I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis.

Also, does anyone have advice for other online clustering algorithms? (lmgtfy failed ;))

like image 931
Theodor Avatar asked Sep 13 '10 07:09

Theodor


People also ask

Do k-means clustering online?

Yes there is. Google failed to find it because it's more commonly known as "sequential k-means". The beautiful thing about it is that you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster.

What is online clustering?

In computer science, data stream clustering is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc.

How do you do k-means clustering in Excel?

Step 1: Choose the number of clusters k. Step 2: Make an initial assignment of the data elements to the k clusters. Step 3: For each cluster select its centroid. Step 4: Based on centroids make a new assignment of data elements to the k clusters.

How do you visualize K mean?

The k-means algorithm captures the insight that each point in a cluster should be near to the center of that cluster. It works like this: first we choose k, the number of clusters we want to find in the data. Then, the centers of those k clusters, called centroids, are initialized in some fashion, (discussed later).


1 Answers

Yes there is. Google failed to find it because it's more commonly known as "sequential k-means".

You can find two pseudo-code implementations of sequential K-means in this section of some Princeton CS class notes by Richard Duda. I've reproduced one of the two implementations below:

Make initial guesses for the means m1, m2, ..., mk Set the counts n1, n2, ..., nk to zero Until interrupted     Acquire the next example, x     If mi is closest to x         Increment ni         Replace mi by mi + (1/ni)*( x - mi)     end_if end_until 

The beautiful thing about it is that you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster. Once you update those two variables, you can throw away the data point.

I'm not sure where you would be able to find a citation for it. I would start looking in Duda's classic text Pattern Classification and Scene Analysis or the newer edition Pattern Classification. If it's not there, you could try Chris Bishop's newest book or Daphne Koller and Nir Friedman's recent text.

like image 111
qdjm Avatar answered Sep 29 '22 16:09

qdjm