Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scipy dendrogram to json for d3.js tree visualisation

Tags:

json

scipy

d3.js

I am trying to convert results of scipy hierarchical clustering into json for display in d3.js here an example

The following codes produces a dendrogram with 6 branches.

import pandas as pd 
import scipy.spatial
import scipy.cluster

d = {'employee' : ['A', 'B', 'C', 'D', 'E', 'F'],
 'skillX': [2,8,3,6,8,10],
 'skillY': [8,15,6,9,7,10]}

d1 = pd.DataFrame(d)

distMat = xPairWiseDist = scipy.spatial.distance.pdist(np.array(d1[['skillX', 'skillY']]), 'euclidean')
clusters = scipy.cluster.hierarchy.linkage(distMat, method='single')
dendo  = scipy.cluster.hierarchy.dendrogram(clusters, labels = list(d1.employee), orientation = 'right')

dendo

my question How can I represent the data in a json file in a format that d3.js understand

{'name': 'Root1’, 
      'children':[{'name' : 'B'},
                  {'name': 'E-D-F-C-A',
                           'children' : [{'name': 'C-A',
                                         'children' : {'name': 'A'}, 
                                                      {'name' : 'C'}]
                                                 }
                   }
                   ]
}

The embarassing truth is that I do not know if I can extract this information from the dendogram or from the linkage matrix and how

I am thankful for any help I can get.

EDIT TO CLARIFY

So far, I have tried to use the totree method but have difficulties understanding its structure (yes, I read the documentation).

a = scipy.cluster.hierarchy.to_tree(clusters , rd=True)

for x in a[1]:
 #print x.get_id()
 if x.is_leaf() != True :
     print  x.get_left().get_id(), x.get_right().get_id(), x.get_count()
like image 902
user1043144 Avatar asked Nov 13 '13 21:11

user1043144


1 Answers

You can do this in three steps:

  1. Recursively construct a nested dictionary that represents the tree returned by Scipy's to_tree method.
  2. Iterate through the nested dictionary to label each internal node with the leaves in its subtree.
  3. dump the resulting nested dictionary to JSON and load into d3.

Construct a nested dictionary representing the dendrogram

For the first step, it is important to call to_tree with rd=False so that the root of the dendrogram is returned. From that root, you can construct the nested dictionary as follows:

# Create a nested dictionary from the ClusterNode's returned by SciPy
def add_node(node, parent ):
    # First create the new node and append it to its parent's children
    newNode = dict( node_id=node.id, children=[] )
    parent["children"].append( newNode )

    # Recursively add the current node's children
    if node.left: add_node( node.left, newNode )
    if node.right: add_node( node.right, newNode )

T = scipy.cluster.hierarchy.to_tree( clusters , rd=False )
d3Dendro = dict(children=[], name="Root1")
add_node( T, d3Dendro )
# Output: => {'name': 'Root1', 'children': [{'node_id': 10, 'children': [{'node_id': 1, 'children': []}, {'node_id': 9, 'children': [{'node_id': 6, 'children': [{'node_id': 0, 'children': []}, {'node_id': 2, 'children': []}]}, {'node_id': 8, 'children': [{'node_id': 5, 'children': []}, {'node_id': 7, 'children': [{'node_id': 3, 'children': []}, {'node_id': 4, 'children': []}]}]}]}]}]}

The basic idea is to start with a node not in the dendrogram that will serve as the root of the whole dendrogram. Then we recursively add left- and right-children to this dictionary until we reach the leaves. At this point, we do not have labels for the nodes, so I'm just labeling nodes by their clusterNode ID.

Label the dendrogram

Next, we need to use the node_ids to label the dendrogram. The comments should be enough explanation for how this works.

# Label each node with the names of each leaf in its subtree
def label_tree( n ):
    # If the node is a leaf, then we have its name
    if len(n["children"]) == 0:
        leafNames = [ id2name[n["node_id"]] ]

    # If not, flatten all the leaves in the node's subtree
    else:
        leafNames = reduce(lambda ls, c: ls + label_tree(c), n["children"], [])

    # Delete the node id since we don't need it anymore and
    # it makes for cleaner JSON
    del n["node_id"]

    # Labeling convention: "-"-separated leaf names
    n["name"] = name = "-".join(sorted(map(str, leafNames)))

    return leafNames

label_tree( d3Dendro["children"][0] )

Dump to JSON and load into D3

Finally, after the dendrogram has been labeled, we just need to output it to JSON and load into D3. I'm just pasting the Python code to dump it to JSON here for completeness.

# Output to JSON
json.dump(d3Dendro, open("d3-dendrogram.json", "w"), sort_keys=True, indent=4)

Output

I created Scipy and D3 versions of the dendrogram below. For the D3 version, I simply plugged the JSON file I output ('d3-dendrogram.json') into this Gist.

SciPy dendrogram

The dendrogram output by SciPy.

D3 dendrogram

The dendrogram output by d3

like image 123
mdml Avatar answered Sep 23 '22 00:09

mdml