Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Can not the computed centroid values to be plotted over the existing plot based on data

EDIT: Ok, if the data are two dimensional as follows:

x = [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5]
y = [8,7,5,4,3,7,8,3,2,1,9,11,16,18,19]

Then, how to calculate the k means (3 values) and make plot?

Can not the computed centroid values be plotted over the existing plot based on data here? I want to make the similiar plot as done in the following link


However, I could not understand. Any help would be highly appreciated.

import numpy as np, matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans, vq

data = np.array(np.random.rand(100))

plt.plot(data, 'ob')

centroids, variances= kmeans(data,3,10)
indices, distances= vq(data,centroids)

print (centroids)
[ 0.82847854  0.49085422  0.18256191]

like image 527
2964502 Avatar asked Nov 24 '13 17:11


1 Answers

A minor edit to answer your question about 2d:

You can use the original answer below, just take:

data = np.column_stack([x,y])

If you want to plot the centroids, it is the same as below in the original answer. If you want to color each value by the group selected, you can use kmeans2

from scipy.cluster.vq import kmeans2

centroids, ks = kmeans2(data, 3, 10)

To plot, pick k colors, then use the ks array returned by kmeans2 to select that color from the three colors:

colors = ['r', 'g', 'b']
plt.scatter(*data.T, c=np.choose(ks, colors))
plt.scatter(*centroids.T, c=colors, marker='v')

two d

original answer:

As @David points out, your data is one dimensional, so the centroid for each cluster will also just be one dimensional. The reason your plot looks 2d is because when you run


if data is 1d, then what the function actually does is plot:

plt.plot(range(len(data)), data)

To make this clear, see this example:

data = np.array([3,2,3,4,3])
centroids, variances= kmeans(data, 3, 10)


Then the centroids will be one dimensional, so they have no x location in that plot, so you could plot them as lines, for example:

for c in centroids:


If you want to find the centroids of the x-y pairs where x = range(len(data)) and y = data, then you must pass those pairs to the clustering algorithm, like so:

xydata = np.column_stack([range(len(data)), data])
centroids, variances= kmeans(xydata, 3, 10)


But I doubt this is what you want. Probably, you want random x and y values, so try something like:

data = np.random.rand(100,2)
centroids, variances = kmeans(data, 3, 10)
like image 53
askewchan Avatar answered Sep 21 '22 14:09
