I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works. Is there any tutorial that can help me to start with, explaining step by step some simple tasks? Let's say I have the following data set: <pre class="prettyprint"><code>a = np.array([[0, 0 ], [1, 0 ], [0, 1 ], [1, 1 ], [0.5, 0 ], [0, 0.5], [0.5, 0.5], [2, 2 ], [2, 3 ], [3, 2 ], [3, 3 ]]) </code></pre> I can easily do the hierarchy cluster and plot the dendrogram: <pre class="prettyprint"><code>z = linkage(a) d = dendrogram(z) </code></pre> <ul> <li>Now, how I can recover a specific cluster? Let's say the one with elements <code>[0,1,2,4,5,6]</code> in the dendrogram?</li> <li>How I can get back the values of that elements? </li> </ul>

There are three steps in hierarchical agglomerative clustering (HAC): <ol> <li>Quantify Data (<code>metric</code> argument)</li> <li>Cluster Data (<code>method</code> argument)</li> <li>Choose the number of clusters</li> </ol> Doing <pre class="prettyprint"><code>z = linkage(a) </code></pre> will accomplish the first two steps. Since you did not specify any parameters it uses the standard values <ol> <li><code>metric = 'euclidean'</code></li> <li><code>method = 'single'</code></li> </ol> So <code>z = linkage(a)</code> will give you a single linked hierachical agglomerative clustering of <code>a</code>. This clustering is kind of a hierarchy of solutions. From this hierarchy you get some information about the structure of your data. What you might do now is: <ul> <li>Check which <code>metric</code> is appropriate, e. g. <code>cityblock</code> or <code>chebychev</code> will quantify your data differently (<code>cityblock</code>, <code>euclidean</code> and <code>chebychev</code> correspond to <code>L1</code>, <code>L2</code>, and <code>L_inf</code> norm)</li> <li>Check the different properties / behaviours of the <code>methdos</code> (e. g. <code>single</code>, <code>complete</code> and <code>average</code>)</li> <li>Check how to determine the number of clusters, e. g. by reading the wiki about it </li> <li>Compute indices on the found solutions (clusterings) such as the silhouette coefficient (with this coefficient you get a feedback on the quality of how good a point/observation fits to the cluster it is assigned to by the clustering). Different indices use different criteria to qualify a clustering.</li> </ul> Here is something to start with <pre class="prettyprint"><code>import numpy as np import scipy.cluster.hierarchy as hac import matplotlib.pyplot as plt a = np.array([[0.1, 2.5], [1.5, .4 ], [0.3, 1 ], [1 , .8 ], [0.5, 0 ], [0 , 0.5], [0.5, 0.5], [2.7, 2 ], [2.2, 3.1], [3 , 2 ], [3.2, 1.3]]) fig, axes23 = plt.subplots(2, 3) for method, axes in zip(['single', 'complete'], axes23): z = hac.linkage(a, method=method) # Plotting axes[0].plot(range(1, len(z)+1), z[::-1, 2]) knee = np.diff(z[::-1, 2], 2) axes[0].plot(range(2, len(z)), knee) num_clust1 = knee.argmax() + 2 knee[knee.argmax()] = 0 num_clust2 = knee.argmax() + 2 axes[0].text(num_clust1, z[::-1, 2][num_clust1-1], 'possible\n<- knee point') part1 = hac.fcluster(z, num_clust1, 'maxclust') part2 = hac.fcluster(z, num_clust2, 'maxclust') clr = ['#2200CC' ,'#D9007E' ,'#FF6600' ,'#FFCC00' ,'#ACE600' ,'#0099CC' , '#8900CC' ,'#FF0000' ,'#FF9900' ,'#FFFF00' ,'#00CC01' ,'#0055CC'] for part, ax in zip([part1, part2], axes[1:]): for cluster in set(part): ax.scatter(a[part == cluster, 0], a[part == cluster, 1], color=clr[cluster]) m = '\n(method: {})'.format(method) plt.setp(axes[0], title='Screeplot{}'.format(m), xlabel='partition', ylabel='{}\ncluster distance'.format(m)) plt.setp(axes[1], title='{} Clusters'.format(num_clust1)) plt.setp(axes[2], title='{} Clusters'.format(num_clust2)) plt.tight_layout() plt.show() </code></pre> Gives <img src="https://i.stack.imgur.com/YKRpm.png" alt="enter image description here">

Tutorial for scipy.cluster.hierarchy [closed]

Tags:

python

scipy

hierarchical-clustering

I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works.

Is there any tutorial that can help me to start with, explaining step by step some simple tasks?

Let's say I have the following data set:

a = np.array([[0,   0  ],               [1,   0  ],               [0,   1  ],               [1,   1  ],                [0.5, 0  ],               [0,   0.5],               [0.5, 0.5],               [2,   2  ],               [2,   3  ],               [3,   2  ],               [3,   3  ]])

I can easily do the hierarchy cluster and plot the dendrogram:

z = linkage(a) d = dendrogram(z)

Now, how I can recover a specific cluster? Let's say the one with elements [0,1,2,4,5,6] in the dendrogram?
How I can get back the values of that elements?

890

asked Feb 07 '14 21:02

user2988577

1 Answers

There are three steps in hierarchical agglomerative clustering (HAC):

Quantify Data (metric argument)
Cluster Data (method argument)
Choose the number of clusters

Doing

z = linkage(a)

will accomplish the first two steps. Since you did not specify any parameters it uses the standard values

metric = 'euclidean'
method = 'single'

So z = linkage(a) will give you a single linked hierachical agglomerative clustering of a. This clustering is kind of a hierarchy of solutions. From this hierarchy you get some information about the structure of your data. What you might do now is:

Check which metric is appropriate, e. g. cityblock or chebychev will quantify your data differently (cityblock, euclidean and chebychev correspond to L1, L2, and L_inf norm)
Check the different properties / behaviours of the methdos (e. g. single, complete and average)
Check how to determine the number of clusters, e. g. by reading the wiki about it
Compute indices on the found solutions (clusterings) such as the silhouette coefficient (with this coefficient you get a feedback on the quality of how good a point/observation fits to the cluster it is assigned to by the clustering). Different indices use different criteria to qualify a clustering.

Here is something to start with

import numpy as np import scipy.cluster.hierarchy as hac import matplotlib.pyplot as plt   a = np.array([[0.1,   2.5],               [1.5,   .4 ],               [0.3,   1  ],               [1  ,   .8 ],               [0.5,   0  ],               [0  ,   0.5],               [0.5,   0.5],               [2.7,   2  ],               [2.2,   3.1],               [3  ,   2  ],               [3.2,   1.3]])  fig, axes23 = plt.subplots(2, 3)  for method, axes in zip(['single', 'complete'], axes23):     z = hac.linkage(a, method=method)      # Plotting     axes[0].plot(range(1, len(z)+1), z[::-1, 2])     knee = np.diff(z[::-1, 2], 2)     axes[0].plot(range(2, len(z)), knee)      num_clust1 = knee.argmax() + 2     knee[knee.argmax()] = 0     num_clust2 = knee.argmax() + 2      axes[0].text(num_clust1, z[::-1, 2][num_clust1-1], 'possible\n<- knee point')      part1 = hac.fcluster(z, num_clust1, 'maxclust')     part2 = hac.fcluster(z, num_clust2, 'maxclust')      clr = ['#2200CC' ,'#D9007E' ,'#FF6600' ,'#FFCC00' ,'#ACE600' ,'#0099CC' ,     '#8900CC' ,'#FF0000' ,'#FF9900' ,'#FFFF00' ,'#00CC01' ,'#0055CC']      for part, ax in zip([part1, part2], axes[1:]):         for cluster in set(part):             ax.scatter(a[part == cluster, 0], a[part == cluster, 1],                         color=clr[cluster])      m = '\n(method: {})'.format(method)     plt.setp(axes[0], title='Screeplot{}'.format(m), xlabel='partition',              ylabel='{}\ncluster distance'.format(m))     plt.setp(axes[1], title='{} Clusters'.format(num_clust1))     plt.setp(axes[2], title='{} Clusters'.format(num_clust2))  plt.tight_layout() plt.show()

Gives enter image description here

answered Oct 05 '22 11:10

embert

Related questions
                            
                                python mock side_effect or return_value dependent on call_count
                            
                                Default value for next element in Python iterator if iterator is empty?
                            
                                Starting supervisord as root or not?
                            
                                how set column as date index?
                            
                                How do I use a relative path in a Python module when the CWD has changed?
                            
                                BeautifulSoup and lxml.html - what to prefer? [duplicate]
                            
                                Retrieve a task result object, given a `task_id` in Celery
                            
                                How to connect Python to Db2
                            
                                Remove rows in python less than a certain value
                            
                                Create single row python pandas dataframe
                            
                                How can I "merge" rows by same value in a column in Pandas with aggregation functions?
                            
                                How to implement a minimal server for AJAX in Python?
                            
                                Python: take max N elements from some list
                            
                                Slice Pandas DataFrame by Row
                            
                                Size of raw response in bytes
                            
                                Is there an __repr__ equivalent for javascript?
                            
                                One colorbar for seaborn heatmaps in subplot
                            
                                Programming on samsung chromebook [closed]
                            
                                Django REST Framework - Serializing optional fields
                            
                                Draw graph in NetworkX

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With