In sklearn there is one agglomerative clustering algorithm implemented, the ward method minimizing variance. Usually sklearn is documented with lots of nice usage examples, but I couldn't find examples of how to use this function. Basically my problem is to draw a dendrogram according to the clustering of my data, but I don't understand the output from the function. The documentation says that it returns the children, the number of components, the number of leaves and the parents of each node. Yet for my data samples, the results don't give any meaning. For a (32,542) matrix that has been clustered with a connectivity matrix this is the output: <pre class="prettyprint"><code>>>> wt = ward_tree(mymat, connectivity=connectivity, n_clusters=2) >>> mymat.shape (32, 542) >>> wt (array([[16, 0], [17, 1], [18, 2], [19, 3], [20, 4], [21, 5], [22, 6], [23, 7], [24, 8], [25, 9], [26, 10], [27, 11], [28, 12], [29, 13], [30, 14], [31, 15], [34, 33], [47, 46], [41, 40], [36, 35], [45, 44], [48, 32], [50, 42], [38, 37], [52, 43], [54, 39], [53, 51], [58, 55], [56, 49], [60, 57]]), 1, 32, array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 53, 48, 48, 51, 51, 55, 55, 57, 50, 50, 54, 56, 52, 52, 49, 49, 53, 60, 54, 58, 56, 58, 57, 59, 60, 61, 59, 59, 61, 61])) </code></pre> In this case I asked for two clusters, with 32 vectors containing features. But how are the two clusters visible in the data? Where are they? And what do the children really mean here, how can the children be higher numbers than the total number of samples?

About the first argument of output, the documentation says <blockquote> The children of each non-leaf node. Values less than n_samples refer to leaves of the tree. A greater value i indicates a node with children children[i - n_samples]. </blockquote> I had some trouble figuring what this means, but then this code helped. We generate normally distributed data with two "clusters", one with 3 data points with mean 0, and one with 2 data points with mean 100. So we expect that the 3 first data point will end up in one branch of the output tree and the the other 2 in another. <pre class="prettyprint"><code>from sklearn.cluster import ward_tree import numpy as np import itertools X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100]) w = ward_tree(X) ii = itertools.count(w[2]) [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in w[0]] </code></pre> Which produces the tree: <pre class="prettyprint"><code>[{'node_id': 5, 'right': 2, 'left': 1}, {'node_id': 6, 'right': 4, 'left': 3}, {'node_id': 7, 'right': 5, 'left': 0}, {'node_id': 8, 'right': 7, 'left': 6}] </code></pre> where the numbers are node id's. If node_id < 5 (the number of samples) then it's an index to a data point (or leaf node). If node_id >= 5 then it's an internal node. We see that the data clusters as expected: <pre class="prettyprint"><code> 8 / \ 7 \ / \ \ 5 \ 6 / \ \ / \ 1 2 0 3 4 </code></pre>

How do you visualize a ward tree from sklearn.cluster.ward_tree?

Tags:

python

machine-learning

scikit-learn

hierarchical-clustering

In sklearn there is one agglomerative clustering algorithm implemented, the ward method minimizing variance. Usually sklearn is documented with lots of nice usage examples, but I couldn't find examples of how to use this function.

Basically my problem is to draw a dendrogram according to the clustering of my data, but I don't understand the output from the function. The documentation says that it returns the children, the number of components, the number of leaves and the parents of each node.

Yet for my data samples, the results don't give any meaning. For a (32,542) matrix that has been clustered with a connectivity matrix this is the output:

>>> wt = ward_tree(mymat, connectivity=connectivity, n_clusters=2)

>>> mymat.shape
(32, 542)
>>> wt
(array([[16,  0],
       [17,  1],
       [18,  2],
       [19,  3],
       [20,  4],
       [21,  5],
       [22,  6],
       [23,  7],
       [24,  8],
       [25,  9],
       [26, 10],
       [27, 11],
       [28, 12],
       [29, 13],
       [30, 14],
       [31, 15],
       [34, 33],
       [47, 46],
       [41, 40],
       [36, 35],
       [45, 44],
       [48, 32],
       [50, 42],
       [38, 37],
       [52, 43],
       [54, 39],
       [53, 51],
       [58, 55],
       [56, 49],
       [60, 57]]), 1, 32, array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,     45, 46, 47, 32,
       33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 53, 48,
   48, 51, 51, 55, 55, 57, 50, 50, 54, 56, 52, 52, 49, 49, 53, 60, 54,
   58, 56, 58, 57, 59, 60, 61, 59, 59, 61, 61]))

In this case I asked for two clusters, with 32 vectors containing features. But how are the two clusters visible in the data? Where are they? And what do the children really mean here, how can the children be higher numbers than the total number of samples?

662

asked Feb 28 '14 21:02

user1603472

1 Answers

About the first argument of output, the documentation says

The children of each non-leaf node. Values less than n_samples refer to leaves of the tree. A greater value i indicates a node with children children[i - n_samples].

I had some trouble figuring what this means, but then this code helped. We generate normally distributed data with two "clusters", one with 3 data points with mean 0, and one with 2 data points with mean 100. So we expect that the 3 first data point will end up in one branch of the output tree and the the other 2 in another.

from sklearn.cluster import ward_tree
import numpy as np
import itertools
X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
w = ward_tree(X)
ii = itertools.count(w[2])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in w[0]]

Which produces the tree:

[{'node_id': 5, 'right': 2, 'left': 1},
 {'node_id': 6, 'right': 4, 'left': 3}, 
 {'node_id': 7, 'right': 5, 'left': 0}, 
 {'node_id': 8, 'right': 7, 'left': 6}]

where the numbers are node id's. If node_id < 5 (the number of samples) then it's an index to a data point (or leaf node). If node_id >= 5 then it's an internal node. We see that the data clusters as expected:

         8
     /       \
    7         \  
   / \         \
  5   \         6
 / \   \       / \
1   2   0     3   4

answered Nov 11 '22 07:11

A.P.

Related questions
                            
                                How to store a crypto key securely?
                            
                                python3 unicode-escape doesn't work with non-ascii bytes?
                            
                                How to use multiple delimiters when using csv.reader in python?
                            
                                Can I use Python to capture keyboard and mouse events in OSX?
                            
                                Time complexity of random.sample
                            
                                how to avoid python numeric literals beginning with "0" being treated as octal?
                            
                                Cython and fortran - how to compile together without f2py
                            
                                Which special methods bypasses __getattribute__ in Python?
                            
                                Please explain reasoning behind code snippet from argparse module
                            
                                Trying to verify SHA1 message signature using Python. What am I doing wrong?
                            
                                How does a function in Python remember its values after it returns? [duplicate]
                            
                                Iterating over a Django QuerySet while deleting objects in the same QuerySet
                            
                                Sorting algorithm to keep equal values separated
                            
                                How do I unit-test HTTPS requests in Flask?
                            
                                Numba slow when assigning to an array?
                            
                                How can I transform this (100, 100) numpy array into a grayscale sprite in pygame?
                            
                                PickleType with Mutable Tracking in SqlAlchemy
                            
                                Django Elastic Beanstalk Deploy showing 404
                            
                                How to search content across multiple channel in youtube api?
                            
                                LiveServerTestCase hangs at python-requests post call in django view

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With