EDIT: Ok, if the data are two dimensional as follows:
x = [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5]
y = [8,7,5,4,3,7,8,3,2,1,9,11,16,18,19]
Then, how to calculate the k means (3 values) and make plot?
Can not the computed centroid values be plotted over the existing plot based on data here? I want to make the similiar plot as done in the following link
http://glowingpython.blogspot.jp/2012/04/k-means-clustering-with-scipy.html
However, I could not understand. Any help would be highly appreciated.
import numpy as np, matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans, vq
data = np.array(np.random.rand(100))
plt.plot(data, 'ob')
centroids, variances= kmeans(data,3,10)
indices, distances= vq(data,centroids)
print (centroids)
[ 0.82847854 0.49085422 0.18256191]
plt.show()
You can use the original answer below, just take:
data = np.column_stack([x,y])
If you want to plot the centroids, it is the same as below in the original answer. If you want to color each value by the group selected, you can use kmeans2
from scipy.cluster.vq import kmeans2
centroids, ks = kmeans2(data, 3, 10)
To plot, pick k
colors, then use the ks
array returned by kmeans2
to select that color from the three colors:
colors = ['r', 'g', 'b']
plt.scatter(*data.T, c=np.choose(ks, colors))
plt.scatter(*centroids.T, c=colors, marker='v')
As @David points out, your data
is one dimensional, so the centroid for each cluster will also just be one dimensional. The reason your plot looks 2d is because when you run
plt.plot(data)
if data
is 1d, then what the function actually does is plot:
plt.plot(range(len(data)), data)
To make this clear, see this example:
data = np.array([3,2,3,4,3])
centroids, variances= kmeans(data, 3, 10)
plt.plot(data)
Then the centroids will be one dimensional, so they have no x
location in that plot, so you could plot them as lines, for example:
for c in centroids:
plt.axhline(c)
If you want to find the centroids of the x-y pairs where x = range(len(data))
and y = data
, then you must pass those pairs to the clustering algorithm, like so:
xydata = np.column_stack([range(len(data)), data])
centroids, variances= kmeans(xydata, 3, 10)
But I doubt this is what you want. Probably, you want random x
and y
values, so try something like:
data = np.random.rand(100,2)
centroids, variances = kmeans(data, 3, 10)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With