I can't seam to find any simple enough tutorials or descriptions on clustering in scipy, so I'll try to explain my problem: I try to cluster documents (hierarchical agglomerative clustering) , and have created a vector for each document and produced a symmetric distance matrix. The vector_list contains (really long) vectors representing each document. The order of this list of vectors is the same as my list of input documents so that I'll (hopefully) be able to match the results of the clustering with the corresponding document. <pre class="prettyprint"><code>distances = distance.cdist(vector_list, vector_list, 'euclidean') </code></pre> This gives a matrix like this, where the diagonal line is each documents distance to itself (always 0) <pre class="prettyprint"><code>[0 5 4] [5 0 4] [5 4 0] </code></pre> I feed this distance matrix to scipys' linkage() function. <pre class="prettyprint"><code>clusters = hier.linkage(distances, method='centroid', metric='euclidean') </code></pre> this returns something I'm not quite sure what is, but comes out as datatype numpy.ndarray. According to the docs I can feed this again into fcluster to get 'flat clusters'. I use half of the max distance in the distance matrix as threshold. <pre class="prettyprint"><code>idx = hier.fcluster(clu,0.5*distances.max(), 'distance') </code></pre> This returns a numpy.ndarray that again does not make much sense to me. An example is [6 3 1 7 1 8 9 4 5 2] So my question: what is it that I get from the linkage and fcluster functions, and how can I go from there and back to my document that I created the distance matrix for in the first place, to see if the clusters makes any sense? Am I doing this right?

First off, you don't need to go through the entire process with <code>cdist</code> and <code>linkage</code> if you use <code>fclusterdata</code> instead of <code>fcluster</code>; that function you can feed an <code>(n_documents, n_features)</code> array of term counts, tf-idf values, or whatever your features are. The output from <code>fclusterdata</code> is the same as that of <code>fcluster</code>: an array <code>T</code> such that "<code>T[i]</code> is the flat cluster number to which original observation <code>i</code> belongs." I.e., the <code>cluster.hierarchy</code> module flattens the clustering according to a threshold which you set at <code>0.5*distances.max()</code>. In your case, the third and fifth document are clustered together, but all the others form clusters of their own, so you might want to set the threshold higher or using a different <code>criterion</code>.

Clustering with scipy - clusters via distance matrix, how to get back the original objects

Tags:

python

scipy

cluster-analysis

I can't seam to find any simple enough tutorials or descriptions on clustering in scipy, so I'll try to explain my problem:

I try to cluster documents (hierarchical agglomerative clustering) , and have created a vector for each document and produced a symmetric distance matrix. The vector_list contains (really long) vectors representing each document. The order of this list of vectors is the same as my list of input documents so that I'll (hopefully) be able to match the results of the clustering with the corresponding document.

distances = distance.cdist(vector_list, vector_list, 'euclidean')

This gives a matrix like this, where the diagonal line is each documents distance to itself (always 0)

[0 5 4]
[5 0 4]
[5 4 0]

I feed this distance matrix to scipys' linkage() function.

clusters = hier.linkage(distances, method='centroid', metric='euclidean')

this returns something I'm not quite sure what is, but comes out as datatype numpy.ndarray. According to the docs I can feed this again into fcluster to get 'flat clusters'. I use half of the max distance in the distance matrix as threshold.

idx = hier.fcluster(clu,0.5*distances.max(), 'distance')

This returns a numpy.ndarray that again does not make much sense to me. An example is [6 3 1 7 1 8 9 4 5 2]

So my question: what is it that I get from the linkage and fcluster functions, and how can I go from there and back to my document that I created the distance matrix for in the first place, to see if the clusters makes any sense? Am I doing this right?

258

asked Oct 11 '11 10:10

Eiriks

1 Answers

First off, you don't need to go through the entire process with cdist and linkage if you use fclusterdata instead of fcluster; that function you can feed an (n_documents, n_features) array of term counts, tf-idf values, or whatever your features are.

The output from fclusterdata is the same as that of fcluster: an array T such that "T[i] is the flat cluster number to which original observation i belongs." I.e., the cluster.hierarchy module flattens the clustering according to a threshold which you set at 0.5*distances.max(). In your case, the third and fifth document are clustered together, but all the others form clusters of their own, so you might want to set the threshold higher or using a different criterion.

156

answered Oct 02 '22 22:10

Fred Foo

Related questions
                            
                                (python) matplotlib pyplot show() .. blocking or not?
                            
                                NumPy/SciPy: Move mask over Image and check for equality
                            
                                GC Doesn't Delete Circular References in WeakKeyDictionaries?
                            
                                Losing elements in python code while creating a dictionary from a list?
                            
                                Generator Function Performance
                            
                                Parsing pcap files with dpkt (Python)
                            
                                Python lxml iterfind w/ namespace but prefix=None
                            
                                lxml removes spaces and line breaks in <head>
                            
                                Python dependency analyzer library
                            
                                python regex to split on certain patterns with skip patterns
                            
                                How add/change password for RSA priv key using PyCrypto
                            
                                performing sum of outer products on sparse matrices
                            
                                Modelling many-to-many with relation data in Google App Engine
                            
                                How to resize QMainWindow after removing all DockWidgets?
                            
                                How to structure a python projects with shared sub apps using git and buidout without symbolic links
                            
                                Advanced python string search
                            
                                Dynamically creating classes in python and __repr__
                            
                                Is there a Python API for drawing diagrams (that use lines to connect corresponding values between two lists) [closed]
                            
                                How can encode('ascii', 'ignore') throw a UnicodeDecodeError?
                            
                                Python self contained web application and server?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With