I can't seam to find any simple enough tutorials or descriptions on clustering in scipy, so I'll try to explain my problem:
I try to cluster documents (hierarchical agglomerative clustering) , and have created a vector for each document and produced a symmetric distance matrix. The vector_list contains (really long) vectors representing each document. The order of this list of vectors is the same as my list of input documents so that I'll (hopefully) be able to match the results of the clustering with the corresponding document.
distances = distance.cdist(vector_list, vector_list, 'euclidean')
This gives a matrix like this, where the diagonal line is each documents distance to itself (always 0)
[0 5 4]
[5 0 4]
[5 4 0]
I feed this distance matrix to scipys' linkage() function.
clusters = hier.linkage(distances, method='centroid', metric='euclidean')
this returns something I'm not quite sure what is, but comes out as datatype numpy.ndarray. According to the docs I can feed this again into fcluster to get 'flat clusters'. I use half of the max distance in the distance matrix as threshold.
idx = hier.fcluster(clu,0.5*distances.max(), 'distance')
This returns a numpy.ndarray that again does not make much sense to me. An example is [6 3 1 7 1 8 9 4 5 2]
So my question: what is it that I get from the linkage and fcluster functions, and how can I go from there and back to my document that I created the distance matrix for in the first place, to see if the clusters makes any sense? Am I doing this right?
Clustering starts by computing a distance between every pair of units that you want to cluster. A distance matrix will be symmetric (because the distance between x and y is the same as the distance between y and x) and will have zeroes on the diagonal (because every item is distance zero from itself).
Performs hierarchical/agglomerative clustering on the condensed distance matrix y. sized vector where n is the number of original observations paired in the distance matrix. The behavior of this function is very similar to the MATLAB linkage function.
cluster. hierarchy ) These functions cut hierarchical clusterings into flat clusterings or find the roots of the forest formed by a cut by providing the flat cluster ids of each observation. fcluster (Z, t[, criterion, depth, R, monocrit])
In Average linkage clustering, the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group.
First off, you don't need to go through the entire process with cdist
and linkage
if you use fclusterdata
instead of fcluster
; that function you can feed an (n_documents, n_features)
array of term counts, tf-idf values, or whatever your features are.
The output from fclusterdata
is the same as that of fcluster
: an array T
such that "T[i]
is the flat cluster number to which original observation i
belongs." I.e., the cluster.hierarchy
module flattens the clustering according to a threshold which you set at 0.5*distances.max()
. In your case, the third and fifth document are clustered together, but all the others form clusters of their own, so you might want to set the threshold higher or using a different criterion
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With