How to traverse a tree from sklearn AgglomerativeClustering?

Tags:

I have a numpy text file array at: https://github.com/alvations/anythingyouwant/blob/master/WN_food.matrix

It's a distance matrix between terms and each other, my list of terms are as such: http://pastebin.com/2xGt7Xjh

I used the follow code to generate a hierarchical cluster:

import numpy as np
from sklearn.cluster import AgglomerativeClustering

matrix = np.loadtxt('WN_food.matrix')
n_clusters = 518
model = AgglomerativeClustering(n_clusters=n_clusters,
                                linkage="average", affinity="cosine")
model.fit(matrix)

To get the clusters for each term, I could have done:

for term, clusterid in enumerate(model.labels_):
    print term, clusterid

But how do I traverse the tree that the AgglomerativeClustering outputs?

Is it possible to convert it into a scipy dendrogram (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html)? And after that how do I traverse the dendrogram?

494

asked Dec 09 '14 18:12

alvas

2 Answers

I've answered a similar question for sklearn.cluster.ward_tree: How do you visualize a ward tree from sklearn.cluster.ward_tree?

AgglomerativeClustering outputs the tree in the same way, in the children_ attribute. Here's an adaptation of the code in the ward tree question for AgglomerativeClustering. It outputs the structure of the tree in the form (node_id, left_child, right_child) for each node of the tree.

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools

X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
model = AgglomerativeClustering(linkage="average", affinity="cosine")
model.fit(X)

ii = itertools.count(X.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in model.children_]

https://stackoverflow.com/a/26152118

155

answered Oct 08 '22 00:10

A.P.

Adding to A.P.'s answer, here is code that will give you a dictionary of membership. member[node_id] gives all the data point indices (zero to n).

on_split is a simple reformat of A.P's clusters that give the two clusters that form when node_id is split.

up_merge tells what node_id merges into and what node_id must be combined to merge into that.

ii = itertools.count(data_x.shape[0])
clusters = [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in fit_cluster.children_]

import copy
n_points = data_x.shape[0]
members = {i:[i] for i in range(n_points)}
for cluster in clusters:
    node_id = cluster["node_id"]
    members[node_id] = copy.deepcopy(members[cluster["left"]])
    members[node_id].extend(copy.deepcopy(members[cluster["right"]]))

on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})

answered Oct 08 '22 01:10

David Bernat

Related questions
                            
                                factory_boy objects seem to lack primary keys
                            
                                Are C++11 containers supported by Cython?
                            
                                PostgreSQL TypeError: not all arguments converted during string formatting
                            
                                Variable in Flask static files routing [url_for('static', filename='')] [duplicate]
                            
                                How to I invert a 2d list in python [duplicate]
                            
                                Generate a random alphanumeric string as a primary key for a model
                            
                                how can i tell if a string contains ONLY digits and spaces in python using regex
                            
                                Using Python threads to make thousands of calls to a slow API with a rate limit
                            
                                What do "chunk", "block", "offset", "buffer", and "sector" mean?
                            
                                matplotlib requires pyparsing
                            
                                Dynamic Django Mail Configuration
                            
                                Possible to have a method on an APIView called from a url
                            
                                Filter out rows with more than certain number of NaN
                            
                                Python: PyQt QTreeview example - selection
                            
                                Python Line_profiler and Cython function
                            
                                Starting phantomJS from a script in a cronjob
                            
                                python curses tty screen blink
                            
                                How to solve TypeError: cannot serialize float Python Elementtree
                            
                                Matplotlib can't suppress figure window
                            
                                Converting a single line string to integer array in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to traverse a tree from sklearn AgglomerativeClustering?

Tags:

python

machine-learning

scipy

scikit-learn

hierarchical-clustering

alvas

People also ask

2 Answers

A.P.

David Bernat

Recent Activity

Donate For Us