Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting ndarray generated by hcluster into a Newick string for use with ete2 package

I have a list of vectors created by running:

import hcluster
import numpy as np
from ete2 import Tree

vecs = [np.array(i) for i in document_list] 

where document_list is a collection of web documents I am analysing. I then perform hierarchical clustering:

Z = hcluster.linkage(vecs, metric='cosine') 

This generates an ndarray such as:

[[ 12.          19.           0.           1.        ]
[ 15.          21.           0.           3.        ]
[ 18.          22.           0.           4.        ]
[  3.          16.           0.           7.        ]
[  8.          23.           0.           6.        ]
[  5.          27.           0.           6.        ]
[  1.          28.           0.           7.        ]
[  0.          21.           0.           2.        ]
[  5.          29.           0.18350472   2.        ]
[  2.          10.           0.18350472   3.        ]
[ 47.          30.           0.29289577   9.        ]
[ 13.          28.           0.29289577  13.        ]
[ 73.          32.           0.29289577  18.        ]
[ 26.          12.           0.42264521   5.        ]
[  5.          33.           0.42264521  12.        ]
[ 14.          35.           0.42264521  12.        ]
[ 19.          35.           0.42264521  18.        ]
[  4.          20.           0.31174826   3.        ]
[ 34.          21.           0.5         19.        ]
[ 38.          29.           0.31174826  21.        ]]

Is it possible to convert this ndarray into a newick string that can be passed to the ete2 Tree() constructor so that I can draw and manipulate a newick tree using the tools provided by ete2?

Does it even make sense to try and do this and if not is there another way that I can generate a tree/dendrogram using the same data and ete2 (I realise that there are other packages that can draw dendrograms such as dendropy and hcluster itself but would prefer to use ete2 all the same)?

Thanks!

like image 246
user1221408 Avatar asked Feb 20 '12 16:02

user1221408


2 Answers

I use the following approach for pretty much the same thing:

from hcluster import linkage, to_tree
from ete2 import Tree

#hcluster part
Y = dist_matrix(items, dist_fn)
Z = linkage(Y, "single")
T = to_tree(Z)

#ete2 section
root = Tree()
root.dist = 0
root.name = "root"
item2node = {T: root}

to_visit = [T]
while to_visit:
    node = to_visit.pop()
    cl_dist = node.dist /2.0
    for ch_node in [node.left, node.right]:
        if ch_node:
            ch = Tree()
            ch.dist = cl_dist
            ch.name = str(ch_node.id)
            item2node[node].add_child(ch)
            item2node[ch_node] = ch
            to_visit.append(ch_node)

# This is your ETE tree structure
tree = root
like image 67
jhc Avatar answered Oct 10 '22 17:10

jhc


Update:

from hcluster import linkage, to_tree
from ete2 import Tree

#hcluster part
Y = dist_matrix(items, dist_fn)
Z = linkage(Y, "single")

R,T       = to_tree( mat, rd=True )
#print "ROOT", R, "TREE", T
root      = Tree()
root.dist = 0
root.name = 'root'
item2node = {R.get_id(): root}
to_visit  = T

while to_visit:
    node = to_visit.pop()
    #print "NODE", node
    cl_dist = node.dist / 2.0

    for ch_node in [node.get_left(), node.get_right()]:
        if ch_node:
            ch_node_id         = ch_node.get_id()
            ch_node_name       = str(ch_node_id)
            ch                 = Tree()
            ch.dist            = cl_dist
            ch.name            = ch_node_name

            if nodeNames:
                if ch_node_id < len(nodeNames):
                    ch.name    = nodeNames[ ch_node_id ]

            item2node[ch_node_id] = ch
            item2node[ch_node_id].add_child(ch)
            to_visit.append(ch_node)
like image 41
Saulo Alves Avatar answered Oct 10 '22 16:10

Saulo Alves