Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using scipy's kmeans2 function in python

I found this example for using kmeans2 algorithm in python. I can't get the following part

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

# whiten them
z = whiten(z)

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

The points are zip(xy[:,0],xy[:,1]), so what is the third value z doing here?

Also what is whitening?

Any explanation is appreciated. Thanks.

like image 890
kamalbanga Avatar asked Nov 28 '13 17:11

kamalbanga


1 Answers

First:

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

The weirdest thing about this is that it's equivalent to:

z = numpy.sin(0.8*xy[:, 1])

So I don't know why it's written that way. maybe there's a typo?

Next,

# whiten them
z = whiten(z)

whitening is simply normalizing the variance of the population. See here for a demo:

>>> z = np.sin(.8*xy[:, 1])      # the original z
>>> zw = vq.whiten(z)            # save it under a different name
>>> zn = z / z.std()             # make another 'normalized' array
>>> map(np.std, [z, zw, zn])     # standard deviations of the three arrays
[0.42645, 1.0, 1.0]
>>> np.allclose(zw, zn)          # whitened is the same as normalized
True

It's not obvious to me why it is whitened. Anyway, moving along:

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

Let's break that into two parts:

data = np.array(zip(xy[:, 0], xy[:, 1], z))

which is a weird (and slow) way of writing

data = np.column_stack([xy, z])

In any case, you started with two arrays and merge them into one:

>>> xy.shape
(30, 2)
>>> z.shape
(30,)
>>> data.shape
(30, 3)

Then it's data that is passed to the kmeans algorithm:

res, idx = vq.kmeans2(data, 3)

So now you can see that it's 30 points in 3d space that are passed to the algorithm, and the confusing part is how the set of points were created.

like image 180
askewchan Avatar answered Sep 21 '22 16:09

askewchan