Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Output 50 samples closest to each cluster center using scikit-learn.k-means library

I have fitted a k-means algorithm on 5000+ samples using the python scikit-learn library. I want to have the 50 samples closest to a cluster center as an output. How do I perform this task?

like image 307
Nipun Alahakoon Avatar asked Nov 07 '14 06:11

Nipun Alahakoon


2 Answers

One correction to the @snarly's answer.

after performing d = km.transform(X)[:, j], d has elements of distances to centroid(j), not similarities.

so in order to give closest top 50 indices, you should remove '-1', i.e.,

ind = np.argsort(d)[::][:50]

(normally, d has sorted score of distance in ascending order.)

Also, perhaps the shorter way of doing

ind = np.argsort(d)[::-1][:50] could be

ind = np.argsort(d)[:-51:-1].

like image 110
JUNPA Avatar answered Oct 14 '22 13:10

JUNPA


If km is the k-means model, the distance to the j'th centroid for each point in an array X is

d = km.transform(X)[:, j]

This gives an array of len(X) distances. The indices of the 50 closest to centroid j are

ind = np.argsort(d)[::-1][:50]

so the 50 points closest to the centroids are

X[ind]

(or use argpartition if you have a recent enough NumPy, because that's a lot faster).

like image 28
Fred Foo Avatar answered Oct 14 '22 12:10

Fred Foo