The Girvan-Newman algorithm for community detection in networks:
detects communities by progressively removing edges from the original graph. The algorithm removes the “most valuable” edge, traditionally the edge with the highest betweenness centrality, at each step. As the graph breaks down into pieces, the tightly knit community structure is exposed and the result can be depicted as a dendrogram.
In NetworkX the implementation returns an iterator over tuples of sets. First tuple is the first cut consisting of 2 communities, second tuple is the second cut consisting of 3 communities, etc., until the last tuple with n sets for n separate nodes (the leaves of the dendrogram).
import networkx as nx
G = nx.path_graph(10)
comp = nx.community.girvan_newman(G)
list(comp)
[({0, 1, 2, 3, 4}, {5, 6, 7, 8, 9}), ({0, 1}, {2, 3, 4}, {5, 6, 7, 8, 9}), ({0, 1}, {2, 3, 4}, {5, 6}, {8, 9, 7}), ({0, 1}, {2}, {3, 4}, {5, 6}, {8, 9, 7}), ({0, 1}, {2}, {3, 4}, {5, 6}, {7}, {8, 9}), ({0}, {1}, {2}, {3, 4}, {5, 6}, {7}, {8, 9}), ({0}, {1}, {2}, {3}, {4}, {5, 6}, {7}, {8, 9}), ({0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8, 9}), ({0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9})]
Question is: how to plot this dendrogram?
I've looked at scipy.cluster.hierarchy.dendrogram
but it expects a "linkage matrix" I'm guessing such as the one created by scipy.cluster.hierarchy.linkage
, but I'm not sure how I would convert this list of tuples into this "linkage matrix".
So I'm asking how to draw this dendrogram, with/without the help of SciPy's dendrogram
.
Following @ItamarMushkin I followed @mdml's answer with slight modifications and got what I wanted. At high level I'm turning NetworkX's Girvan-Newman iterator output into another DiGraph()
I eventually want to see as a dendogram. Then I build Z
, a linkage matrix I input to scipy.cluster.hierarchy.dendrogram
, in the form of a edgelist that includes the actual height for each dendogram merge.
Two modifications I had to make to @mdml's answer:
index
get_merge_height
function, which gives for each merge its unique height according to Girvan-Newman output order of edges removal. Otherwise, all merges of two nodes would be the same height in the dendrogram, all merges in the next level of merging two nodes and another one would be the same height, etc.I understand there may be some redundant iterations here, I haven't thought about optimization yet.
import networkx as nx
from itertools import chain, combinations
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram
# get simulated Graph() and Girvan-Newman communities list
G = nx.path_graph(10)
communities = list(nx.community.girvan_newman(G))
# building initial dict of node_id to each possible subset:
node_id = 0
init_node2community_dict = {node_id: communities[0][0].union(communities[0][1])}
for comm in communities:
for subset in list(comm):
if subset not in init_node2community_dict.values():
node_id += 1
init_node2community_dict[node_id] = subset
# turning this dictionary to the desired format in @mdml's answer
node_id_to_children = {e: [] for e in init_node2community_dict.keys()}
for node_id1, node_id2 in combinations(init_node2community_dict.keys(), 2):
for node_id_parent, group in init_node2community_dict.items():
if len(init_node2community_dict[node_id1].intersection(init_node2community_dict[node_id2])) == 0 and group == init_node2community_dict[node_id1].union(init_node2community_dict[node_id2]):
node_id_to_children[node_id_parent].append(node_id1)
node_id_to_children[node_id_parent].append(node_id2)
# also recording node_labels dict for the correct label for dendrogram leaves
node_labels = dict()
for node_id, group in init_node2community_dict.items():
if len(group) == 1:
node_labels[node_id] = list(group)[0]
else:
node_labels[node_id] = ''
# also needing a subset to rank dict to later know within all k-length merges which came first
subset_rank_dict = dict()
rank = 0
for e in communities[::-1]:
for p in list(e):
if tuple(p) not in subset_rank_dict:
subset_rank_dict[tuple(sorted(p))] = rank
rank += 1
subset_rank_dict[tuple(sorted(chain.from_iterable(communities[-1])))] = rank
# my function to get a merge height so that it is unique (probably not that efficient)
def get_merge_height(sub):
sub_tuple = tuple(sorted([node_labels[i] for i in sub]))
n = len(sub_tuple)
other_same_len_merges = {k: v for k, v in subset_rank_dict.items() if len(k) == n}
min_rank, max_rank = min(other_same_len_merges.values()), max(other_same_len_merges.values())
range = (max_rank-min_rank) if max_rank > min_rank else 1
return float(len(sub)) + 0.8 * (subset_rank_dict[sub_tuple] - min_rank) / range
# finally using @mdml's magic, slightly modified:
G = nx.DiGraph(node_id_to_children)
nodes = G.nodes()
leaves = set( n for n in nodes if G.out_degree(n) == 0 )
inner_nodes = [ n for n in nodes if G.out_degree(n) > 0 ]
# Compute the size of each subtree
subtree = dict( (n, [n]) for n in leaves )
for u in inner_nodes:
children = set()
node_list = list(node_id_to_children[u])
while len(node_list) > 0:
v = node_list.pop(0)
children.add( v )
node_list += node_id_to_children[v]
subtree[u] = sorted(children & leaves)
inner_nodes.sort(key=lambda n: len(subtree[n])) # <-- order inner nodes ascending by subtree size, root is last
# Construct the linkage matrix
leaves = sorted(leaves)
index = dict( (tuple([n]), i) for i, n in enumerate(leaves) )
Z = []
k = len(leaves)
for i, n in enumerate(inner_nodes):
children = node_id_to_children[n]
x = children[0]
for y in children[1:]:
z = tuple(sorted(subtree[x] + subtree[y]))
i, j = index[tuple(sorted(subtree[x]))], index[tuple(sorted(subtree[y]))]
Z.append([i, j, get_merge_height(subtree[n]), len(z)]) # <-- float is required by the dendrogram function
index[z] = k
subtree[z] = list(z)
x = z
k += 1
# dendrogram
plt.figure()
dendrogram(Z, labels=[node_labels[node_id] for node_id in leaves])
plt.savefig('dendrogram.png')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With