I have clustered various articles together with the Scikit-learn framework. Below are the top 15 words in each cluster:
Cluster 0: whales islands seaworld hurricane whale odile storm tropical kph mph pacific mexico orca coast cabos
Cluster 1: ebola outbreak vaccine africa usaid foundation virus cdc gates disease health vaccines experimental centers obama
Cluster 2: jones bobo sanford children carolina mississippi alabama lexington bodies crumpton mccarty county hyder tennessee sheriff
Cluster 3: isis obama iraq syria president isil airstrikes islamic li strategy terror military war threat al
Cluster 4: yosemite wildfire park evacuation dome firefighters blaze hikers cobb helicopter backcountry trails homes california evacuate
I create the "bag of words" matrix like so:
hasher = TfidfVectorizer(max_df=0.5,
min_df=2, stop_words='english',
use_idf=1)
vectorizer = make_pipeline(hasher, TfidfTransformer())
# document_text_list is a list of all text in a given article
X_train_tfidf = vectorizer.fit_transform(document_text_list)
And then run KMeans like so:
km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
verbose=0, n_clusters=25)
km.fit(X_train_tfidf)
I am printing out the clusters like so:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = hasher.get_feature_names()
for i in range(25):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :15]:
print(' %s' % terms[ind], end='')
print()
However, I would like to know how to figure out which documents all belong in the same cluster, and ideally, their respective distance to the center of the centroid (cluster).
I know that each row of the generated matrix (X_train_tfidf
) corresponds to a document, but there is no obvious way to get back this information after performing the KMeans algorithm. How would I go about doing this with scikit-learn?
X_train_tfidf
looks like:
X_train_tfidf: (0, 4661) 0.0405014425985
(0, 19271) 0.0914545222775
(0, 20393) 0.287636818634
(0, 56027) 0.116893929188
(0, 30872) 0.137815327338
(0, 35256) 0.0343461345507
(0, 31291) 0.209804679792
(0, 66008) 0.0643776635222
(0, 3806) 0.0967713285061
(0, 66338) 0.0532881852791
(0, 65023) 0.0702918299573
(0, 41785) 0.197672720592
(0, 29774) 0.120772893833
(0, 61409) 0.0268609667042
(0, 55527) 0.134102682463
(0, 40011) 0.0582437010271
(0, 19667) 0.0234843097048
(0, 51667) 0.128270976476
(0, 52791) 0.57198926651
(0, 15014) 0.149195054799
(0, 18805) 0.0277497826525
(0, 35939) 0.170775938672
(0, 5808) 0.0473913910636
(0, 24922) 0.0126531527875
(0, 10346) 0.0200098997901
: :
(23945, 56927) 0.0595132327966
(23945, 23259) 0.0100977769025
(23945, 12515) 0.0482102583442
(23945, 49709) 0.210139450446
(23945, 28742) 0.0190221880312
(23945, 16628) 0.137692798005
(23945, 53424) 0.157029848335
(23945, 30647) 0.104485375827
(23945, 57512) 0.0569754813269
(23945, 39389) 0.0158180459761
(23945, 26093) 0.0153713768922
(23945, 9787) 0.0963777149738
(23945, 23260) 0.158336452835
(23945, 50595) 0.0527243936945
(23945, 42447) 0.0527515904547
(23945, 2829) 0.0351677269698
(23945, 2832) 0.0175929392039
(23945, 52079) 0.0849796887889
(23945, 13523) 0.0878730969786
(23945, 57849) 0.133869666381
(23945, 25064) 0.128424780903
(23945, 31129) 0.0919760384953
(23945, 65601) 0.0388718258746
(23945, 1428) 0.391477289626
(23945, 2152) 0.655211469073
X_train_tfidf shape: (23946, 67816)
In Response to ttttthomasssss's Answer:
When I try to run the following:
X_cluster_0 = X_train_tfidf[cluster_0]
I get the error:
File "cluster.py", line 52, in main
X_cluster_0 = X_train_tfidf[cluster_0]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/csr.py", line 226, in __getitem__
col = key[1]
IndexError: tuple index out of range
Looking at the structure of cluster_0
:
(array([ 858, 2012, 2256, 2762, 2920, 3770, 6052, 6174, 8296,
9494, 9966, 10085, 11914, 12117, 12633, 12727, 12993, 13527,
13754, 14186, 14669, 14713, 14973, 15071, 15157, 15208, 15926,
16300, 16301, 17138, 17556, 17775, 18236, 19057, 20106, 21014, 21080]),)
It's a tuple structure that has content in the 0th position so I changed the line to the following:
X_cluster_0 = X_train_tfidf[cluster_0[0]]
I am pulling "documents" from a database that I can easily obtain the index from (iterate the provided array until I find the respective document [assuming of course that scikit doesn't alter orderings of documents in the matrix]). So I don't understand exactly what X_cluster_0
represents. X_cluster_0
has the following structure:
X_cluster_0: (0, 42726) 0.741747456202
(0, 13535) 0.115880661286
(0, 17447) 0.117608794277
(0, 44849) 0.414829246262
(0, 14574) 0.10214258736
(0, 17317) 0.0634383214735
(0, 17935) 0.0591234431875
: :
(17, 33867) 0.0174155914371
(17, 48916) 0.0227046046275
(17, 59132) 0.0168864861723
(17, 40860) 0.0485813219503
(17, 63725) 0.0271415763987
(18, 45019) 0.490135684209
(18, 36168) 0.14595160766
(18, 52304) 0.139590524213
(18, 63586) 0.16501953796
(18, 28709) 0.15075416279
(18, 11495) 0.0926490431993
(18, 40860) 0.124236878928
Calculating Distance to Centroid
Currently running the suggested code (distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
) results in the following error:
File "cluster.py", line 68, in main
distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/spatial/distance.py", line 211, in euclidean
dist = norm(u - v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 197, in __sub__
raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
Here is what km.cluster_centers
looks like:
km.cluster_centers: [ 9.47080802e-05 2.53907413e-03 0.00000000e+00 ..., 0.00000000e+00
0.00000000e+00 0.00000000e+00]
I guess the problem I am having now is how to extract the ith item of a matrix (assuming traversal of the matrix from left to right). Any level of index nesting I specify makes no difference (i.e. X_cluster_0[0]
, X_cluster_0[0][0]
, and X_cluster_0[0][0][0]
all give me the same printed out matrix structure depicted above).
You can use the fit_predict()
function to perform the clustering and obtain the indices of the resulting clusters.
You can try the following:
km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
verbose=0, n_clusters=25)
clusters = km.fit_predict(X_train_tfidf)
# Note that your input data has dimensionality m x n and the clusters array has dimensionality m x 1 and contains the indices for every document
print X_train_tfidf.shape
print clusters.shape
# Example to get all documents in cluster 0
cluster_0 = np.where(clusters==0) # don't forget import numpy as np
# cluster_0 now contains all indices of the documents in this cluster, to get the actual documents you'd do:
X_cluster_0 = X_train_tfidf[cluster_0]
You can get the centroids by doing centroids = km.cluster_centers_
, which in your case should have dimensionality 25 (number of clusters) x n (number of features). For calculating i.e. the euclidean distance of a document to a centroid you can use SciPy (the docs for scipy's various distance metrics can be found here):
# Example, distance for 1 document to 1 cluster centroid
from scipy.spatial.distance import euclidean
distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
print distance
The distance metrics in scipy.spatial.distance
require the input matrices to be dense matrices, so if X_cluster_0
is a sparse matrix you could either convert the matrix to a dense matrix:
d = euclidean(X_cluster_0.A[0], km.cluster_centers_[0]) # Note the .A on X_cluster_0
print d
Alternatively you could use scikit's euclidean_distances()
function, which also works with sparse matrices:
from sklearn.metrics.pairwise import euclidean_distances
D = euclidean_distances(X_cluster_0.getrow(0), km.cluster_centers_[0])
# This would be the equivalent expression to the above scipy example, however note that euclidean_distances returns a matrix and not a scalar
print D
Note that with the scikit method you can also calculate the whole distance matrix at once:
D = euclidean_distances(X_cluster_0, km.cluster_centers_)
print D
X_cluster_0
:X_cluster_0
as well as X_train_tfidf
are both sparse matrices (see the docs: scipy.sparse.csr.csr_matrix
).
The interpretation of a dump such as
(0, 13535) 0.115880661286
(0, 17447) 0.117608794277
(0, 44849) 0.414829246262
(0, 14574) 0.10214258736
. .
. .
would be as follows: (0, 13535)
refers to document 0 and feature 13535, so row number 0 and column number 13535 in your bag of words matrix. The following floating point number 0.115880661286
represents the tf-idf score for that feature in the given document.
To find out the exact word you could try to do hasher.get_feature_names()[13535]
(check len(hasher.get_feature_names())
first to see how many features you have).
If your corpus variable document_text_list
is a list of lists, then the corresponding document would simply be document_text_list[0]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With