Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Predicting Values with k-Means Clustering Algorithm

I'm messing around with machine learning, and I've written a K Means algorithm implementation in Python. It takes a two dimensional data and organises them into clusters. Each data point also has a class value of either a 0 or a 1.

What confuses me about the algorithm is how I can then use it to predict some values for another set of two dimensional data that doesn't have a 0 or a 1, but instead is unknown. For each cluster, should I average the points within it to either a 0 or a 1, and if an unknown point is closest to that cluster, then that unknown point takes on the averaged value? Or is there a smarter method?

Cheers!

like image 549
DizzyDoo Avatar asked Nov 19 '11 10:11

DizzyDoo


People also ask

Can we predict with Kmeans?

Yes you can use k-means to predict clusters.

How do you use K for prediction?

To assign a new data point to one of a set of clusters created by k-means, you just find the centroid nearest to that point. In other words, the same steps you used for the iterative assignment of each point in your original data set to one of k clusters.


1 Answers

To assign a new data point to one of a set of clusters created by k-means, you just find the centroid nearest to that point.

In other words, the same steps you used for the iterative assignment of each point in your original data set to one of k clusters. The only difference here is that the centroids you are using for this computation is the final set--i.e., the values for the centroids at the last iteration.

Here's one implementation in python (w/ NumPy):

>>> import numpy as NP
>>> # just made up values--based on your spec (2D data + 2 clusters)
>>> centroids
      array([[54, 85],
             [99, 78]])

>>> # randomly generate a new data point within the problem domain:
>>> new_data = NP.array([67, 78])

>>> # to assign a new data point to a cluster ID,
>>> # find its closest centroid:
>>> diff = centroids - new_data[0,:]  # NumPy broadcasting
>>> diff
      array([[-13,   7],
             [ 32,   0]])

>>> dist = NP.sqrt(NP.sum(diff**2, axis=-1))  # Euclidean distance
>>> dist
      array([ 14.76,  32.  ])

>>> closest_centroid = centroids[NP.argmin(dist),]
>>> closest_centroid
       array([54, 85])
like image 69
doug Avatar answered Sep 20 '22 01:09

doug