Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to traverse a tree from sklearn AgglomerativeClustering?

I have a numpy text file array at: https://github.com/alvations/anythingyouwant/blob/master/WN_food.matrix

It's a distance matrix between terms and each other, my list of terms are as such: http://pastebin.com/2xGt7Xjh

I used the follow code to generate a hierarchical cluster:

import numpy as np
from sklearn.cluster import AgglomerativeClustering

matrix = np.loadtxt('WN_food.matrix')
n_clusters = 518
model = AgglomerativeClustering(n_clusters=n_clusters,
                                linkage="average", affinity="cosine")
model.fit(matrix)

To get the clusters for each term, I could have done:

for term, clusterid in enumerate(model.labels_):
    print term, clusterid

But how do I traverse the tree that the AgglomerativeClustering outputs?

Is it possible to convert it into a scipy dendrogram (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html)? And after that how do I traverse the dendrogram?

like image 494
alvas Avatar asked Dec 09 '14 18:12

alvas


People also ask

What is AgglomerativeClustering?

Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.

What values can be used for the linkage parameter in AgglomerativeClustering?

method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward. D”, “ward. D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.

What are the important Hyperparameters for AgglomerativeClustering?

Agglomerative Clustering (hierarchical) The main hyperparameter of this mechanism is: n_clusters (the number of clusters you want) — data are successively merged, one at a time, until there are n clusters remaining.


2 Answers

I've answered a similar question for sklearn.cluster.ward_tree: How do you visualize a ward tree from sklearn.cluster.ward_tree?

AgglomerativeClustering outputs the tree in the same way, in the children_ attribute. Here's an adaptation of the code in the ward tree question for AgglomerativeClustering. It outputs the structure of the tree in the form (node_id, left_child, right_child) for each node of the tree.

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools

X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
model = AgglomerativeClustering(linkage="average", affinity="cosine")
model.fit(X)

ii = itertools.count(X.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in model.children_]

https://stackoverflow.com/a/26152118

like image 155
A.P. Avatar answered Oct 08 '22 00:10

A.P.


Adding to A.P.'s answer, here is code that will give you a dictionary of membership. member[node_id] gives all the data point indices (zero to n).

on_split is a simple reformat of A.P's clusters that give the two clusters that form when node_id is split.

up_merge tells what node_id merges into and what node_id must be combined to merge into that.

ii = itertools.count(data_x.shape[0])
clusters = [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in fit_cluster.children_]

import copy
n_points = data_x.shape[0]
members = {i:[i] for i in range(n_points)}
for cluster in clusters:
    node_id = cluster["node_id"]
    members[node_id] = copy.deepcopy(members[cluster["left"]])
    members[node_id].extend(copy.deepcopy(members[cluster["right"]]))

on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})
like image 31
David Bernat Avatar answered Oct 08 '22 01:10

David Bernat