In sklearn there is one agglomerative clustering algorithm implemented, the ward method minimizing variance. Usually sklearn is documented with lots of nice usage examples, but I couldn't find examples of how to use this function.
Basically my problem is to draw a dendrogram according to the clustering of my data, but I don't understand the output from the function. The documentation says that it returns the children, the number of components, the number of leaves and the parents of each node.
Yet for my data samples, the results don't give any meaning. For a (32,542) matrix that has been clustered with a connectivity matrix this is the output:
>>> wt = ward_tree(mymat, connectivity=connectivity, n_clusters=2)
>>> mymat.shape
(32, 542)
>>> wt
(array([[16, 0],
[17, 1],
[18, 2],
[19, 3],
[20, 4],
[21, 5],
[22, 6],
[23, 7],
[24, 8],
[25, 9],
[26, 10],
[27, 11],
[28, 12],
[29, 13],
[30, 14],
[31, 15],
[34, 33],
[47, 46],
[41, 40],
[36, 35],
[45, 44],
[48, 32],
[50, 42],
[38, 37],
[52, 43],
[54, 39],
[53, 51],
[58, 55],
[56, 49],
[60, 57]]), 1, 32, array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 53, 48,
48, 51, 51, 55, 55, 57, 50, 50, 54, 56, 52, 52, 49, 49, 53, 60, 54,
58, 56, 58, 57, 59, 60, 61, 59, 59, 61, 61]))
In this case I asked for two clusters, with 32 vectors containing features. But how are the two clusters visible in the data? Where are they? And what do the children really mean here, how can the children be higher numbers than the total number of samples?
This is an alternative approach for performing cluster analysis. Basically, it looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association. This method involves an agglomerative clustering algorithm.
To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters. If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters.
About the first argument of output, the documentation says
The children of each non-leaf node. Values less than n_samples refer to leaves of the tree. A greater value i indicates a node with children children[i - n_samples].
I had some trouble figuring what this means, but then this code helped. We generate normally distributed data with two "clusters", one with 3 data points with mean 0, and one with 2 data points with mean 100. So we expect that the 3 first data point will end up in one branch of the output tree and the the other 2 in another.
from sklearn.cluster import ward_tree
import numpy as np
import itertools
X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
w = ward_tree(X)
ii = itertools.count(w[2])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in w[0]]
Which produces the tree:
[{'node_id': 5, 'right': 2, 'left': 1},
{'node_id': 6, 'right': 4, 'left': 3},
{'node_id': 7, 'right': 5, 'left': 0},
{'node_id': 8, 'right': 7, 'left': 6}]
where the numbers are node id's. If node_id < 5 (the number of samples) then it's an index to a data point (or leaf node). If node_id >= 5 then it's an internal node. We see that the data clusters as expected:
8
/ \
7 \
/ \ \
5 \ 6
/ \ \ / \
1 2 0 3 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With