Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Find Documents That are in the same Cluster with KMeans

I have clustered various articles together with the Scikit-learn framework. Below are the top 15 words in each cluster:

Cluster 0: whales islands seaworld hurricane whale odile storm tropical kph mph pacific mexico orca coast cabos
Cluster 1: ebola outbreak vaccine africa usaid foundation virus cdc gates disease health vaccines experimental centers obama
Cluster 2: jones bobo sanford children carolina mississippi alabama lexington bodies crumpton mccarty county hyder tennessee sheriff
Cluster 3: isis obama iraq syria president isil airstrikes islamic li strategy terror military war threat al
Cluster 4: yosemite wildfire park evacuation dome firefighters blaze hikers cobb helicopter backcountry trails homes california evacuate

I create the "bag of words" matrix like so:

hasher = TfidfVectorizer(max_df=0.5,
                             min_df=2, stop_words='english',
                             use_idf=1)
vectorizer = make_pipeline(hasher, TfidfTransformer())
# document_text_list is a list of all text in a given article
X_train_tfidf = vectorizer.fit_transform(document_text_list)

And then run KMeans like so:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
km.fit(X_train_tfidf)

I am printing out the clusters like so:

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = hasher.get_feature_names()
for i in range(25):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :15]:
        print(' %s' % terms[ind], end='')
    print()

However, I would like to know how to figure out which documents all belong in the same cluster, and ideally, their respective distance to the center of the centroid (cluster).

I know that each row of the generated matrix (X_train_tfidf) corresponds to a document, but there is no obvious way to get back this information after performing the KMeans algorithm. How would I go about doing this with scikit-learn?

X_train_tfidf looks like:

X_train_tfidf:   (0, 4661)  0.0405014425985
  (0, 19271)    0.0914545222775
  (0, 20393)    0.287636818634
  (0, 56027)    0.116893929188
  (0, 30872)    0.137815327338
  (0, 35256)    0.0343461345507
  (0, 31291)    0.209804679792
  (0, 66008)    0.0643776635222
  (0, 3806) 0.0967713285061
  (0, 66338)    0.0532881852791
  (0, 65023)    0.0702918299573
  (0, 41785)    0.197672720592
  (0, 29774)    0.120772893833
  (0, 61409)    0.0268609667042
  (0, 55527)    0.134102682463
  (0, 40011)    0.0582437010271
  (0, 19667)    0.0234843097048
  (0, 51667)    0.128270976476
  (0, 52791)    0.57198926651
  (0, 15014)    0.149195054799
  (0, 18805)    0.0277497826525
  (0, 35939)    0.170775938672
  (0, 5808) 0.0473913910636
  (0, 24922)    0.0126531527875
  (0, 10346)    0.0200098997901
  : :
  (23945, 56927)    0.0595132327966
  (23945, 23259)    0.0100977769025
  (23945, 12515)    0.0482102583442
  (23945, 49709)    0.210139450446
  (23945, 28742)    0.0190221880312
  (23945, 16628)    0.137692798005
  (23945, 53424)    0.157029848335
  (23945, 30647)    0.104485375827
  (23945, 57512)    0.0569754813269
  (23945, 39389)    0.0158180459761
  (23945, 26093)    0.0153713768922
  (23945, 9787) 0.0963777149738
  (23945, 23260)    0.158336452835
  (23945, 50595)    0.0527243936945
  (23945, 42447)    0.0527515904547
  (23945, 2829) 0.0351677269698
  (23945, 2832) 0.0175929392039
  (23945, 52079)    0.0849796887889
  (23945, 13523)    0.0878730969786
  (23945, 57849)    0.133869666381
  (23945, 25064)    0.128424780903
  (23945, 31129)    0.0919760384953
  (23945, 65601)    0.0388718258746
  (23945, 1428) 0.391477289626
  (23945, 2152) 0.655211469073
  X_train_tfidf shape: (23946, 67816)

In Response to ttttthomasssss's Answer:

When I try to run the following:

X_cluster_0 = X_train_tfidf[cluster_0]

I get the error:

File "cluster.py", line 52, in main
    X_cluster_0 = X_train_tfidf[cluster_0]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/csr.py", line 226, in __getitem__
    col = key[1]
IndexError: tuple index out of range

Looking at the structure of cluster_0:

(array([  858,  2012,  2256,  2762,  2920,  3770,  6052,  6174,  8296,
9494,  9966, 10085, 11914, 12117, 12633, 12727, 12993, 13527,
13754, 14186, 14669, 14713, 14973, 15071, 15157, 15208, 15926,
16300, 16301, 17138, 17556, 17775, 18236, 19057, 20106, 21014, 21080]),)

It's a tuple structure that has content in the 0th position so I changed the line to the following:

X_cluster_0 = X_train_tfidf[cluster_0[0]]

I am pulling "documents" from a database that I can easily obtain the index from (iterate the provided array until I find the respective document [assuming of course that scikit doesn't alter orderings of documents in the matrix]). So I don't understand exactly what X_cluster_0 represents. X_cluster_0 has the following structure:

  X_cluster_0:   (0, 42726) 0.741747456202
  (0, 13535)    0.115880661286
  (0, 17447)    0.117608794277
  (0, 44849)    0.414829246262
  (0, 14574)    0.10214258736
  (0, 17317)    0.0634383214735
  (0, 17935)    0.0591234431875
  : :
  (17, 33867)   0.0174155914371
  (17, 48916)   0.0227046046275
  (17, 59132)   0.0168864861723
  (17, 40860)   0.0485813219503
  (17, 63725)   0.0271415763987
  (18, 45019)   0.490135684209
  (18, 36168)   0.14595160766
  (18, 52304)   0.139590524213
  (18, 63586)   0.16501953796
  (18, 28709)   0.15075416279
  (18, 11495)   0.0926490431993
  (18, 40860)   0.124236878928

Calculating Distance to Centroid

Currently running the suggested code (distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])) results in the following error:

File "cluster.py", line 68, in main
    distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/spatial/distance.py", line 211, in euclidean
    dist = norm(u - v)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 197, in __sub__
    raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported

Here is what km.cluster_centers looks like:

km.cluster_centers: [  9.47080802e-05   2.53907413e-03   0.00000000e+00 ...,   0.00000000e+00
   0.00000000e+00   0.00000000e+00]

I guess the problem I am having now is how to extract the ith item of a matrix (assuming traversal of the matrix from left to right). Any level of index nesting I specify makes no difference (i.e. X_cluster_0[0], X_cluster_0[0][0], and X_cluster_0[0][0][0] all give me the same printed out matrix structure depicted above).

like image 661
Stunner Avatar asked Sep 14 '14 01:09

Stunner


1 Answers

You can use the fit_predict() function to perform the clustering and obtain the indices of the resulting clusters.

Obtaining the cluster index of every document

You can try the following:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
clusters = km.fit_predict(X_train_tfidf)

# Note that your input data has dimensionality m x n and the clusters array has dimensionality m x 1 and contains the indices for every document
print X_train_tfidf.shape
print clusters.shape

# Example to get all documents in cluster 0
cluster_0 = np.where(clusters==0) # don't forget import numpy as np

# cluster_0 now contains all indices of the documents in this cluster, to get the actual documents you'd do:
X_cluster_0 = X_train_tfidf[cluster_0]

Finding the distance of each document to each centroid

You can get the centroids by doing centroids = km.cluster_centers_, which in your case should have dimensionality 25 (number of clusters) x n (number of features). For calculating i.e. the euclidean distance of a document to a centroid you can use SciPy (the docs for scipy's various distance metrics can be found here):

# Example, distance for 1 document to 1 cluster centroid
from scipy.spatial.distance import euclidean

distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
print distance

Update: Distances with Sparse & Dense matrices

The distance metrics in scipy.spatial.distance require the input matrices to be dense matrices, so if X_cluster_0 is a sparse matrix you could either convert the matrix to a dense matrix:

d = euclidean(X_cluster_0.A[0], km.cluster_centers_[0]) # Note the .A on X_cluster_0
print d

Alternatively you could use scikit's euclidean_distances() function, which also works with sparse matrices:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X_cluster_0.getrow(0), km.cluster_centers_[0]) 
# This would be the equivalent expression to the above scipy example, however note that euclidean_distances returns a matrix and not a scalar
print D

Note that with the scikit method you can also calculate the whole distance matrix at once:

D = euclidean_distances(X_cluster_0, km.cluster_centers_)
print D

Update: Structure and Type of X_cluster_0:

X_cluster_0 as well as X_train_tfidf are both sparse matrices (see the docs: scipy.sparse.csr.csr_matrix).

The interpretation of a dump such as

(0, 13535)    0.115880661286
(0, 17447)    0.117608794277
(0, 44849)    0.414829246262
(0, 14574)    0.10214258736
.             .
.             .

would be as follows: (0, 13535) refers to document 0 and feature 13535, so row number 0 and column number 13535 in your bag of words matrix. The following floating point number 0.115880661286 represents the tf-idf score for that feature in the given document.

To find out the exact word you could try to do hasher.get_feature_names()[13535] (check len(hasher.get_feature_names()) first to see how many features you have).

If your corpus variable document_text_list is a list of lists, then the corresponding document would simply be document_text_list[0].

like image 62
tttthomasssss Avatar answered Sep 19 '22 11:09

tttthomasssss