Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge several graphml-files with networkx and remove duplicates

I'm new to programming, Python and networkx (ouch!) and trying to merge four graphml-files into one and removing the duplicate nodes, following the excellent instructions here

However, I can't figure out how to keep track of the duplicate nodes when there are FOUR files to compare, instead of two. The code I've written below won't work, but you can hopefully see how I'm thinking wrong and help me.

# script to merge one or more graphml files into one single graphml file

# First read graphml-files into Python and Networkx (duplicate variables as necessary)
A = nx.read_graphml("file1.graphml")
B = nx.read_graphml("file2.graphml")
C = nx.read_graphml("file3.graphml")
D = nx.read_graphml("file4.graphml")

# Create a new graph variable containing all the previous graphs
H = nx.union(A,B,C,D, rename=('1-','2-','3-','4-'))

# Check what nodes are in two or more of the original graphml-files
duplicate_nodes_a_b = [n for n in A if n in B]
duplicate_nodes_b_c = [n for n in B if n in C]
duplicate_nodes_c_d = [n for n in C if n in D]

all_duplicate_nodes = # How should I get this?

# remove duplicate nodes
for n in all_duplicate nodes:
     n1='1-'+str(n)
     n2='2-'+str(n)
     n3='3-'+str(n)
     n4='4-'+str(n)
     H.add_edges_from([(n1,nbr)for nbr in H[n2]]) # How can I take care of duplicates_nodes_b_c, duplicates_nodes_c_d?
     H.remove_node(n2)

 # write the merged graphml files-variable into a new merged graphml file
 nx.write.graphml(H, "merged_file.graphml", encoding="utf-8", prettyprint=True)
like image 454
mattiasostmar Avatar asked Nov 19 '25 17:11

mattiasostmar


1 Answers

First, note that the way you use nx.union is not what you want. You really need to call it with just two graphs. But how to deal with the duplicates gets complicated this way, because you have to consider all possible pairs of graphs to see how a node could be duplicated.

Instead, let's be more direct and just count up in how many graphs each node appears. This is easy using a Counter:

import collections
ctr = collections.Counter()
for G in [A, B, C, D]:
    ctr.update(G)

Now determine which nodes just appear once, using the counter:

singles = {x for (x,n) in ctr.viewitems() if n == 1}

With that set of nodes, we can then compute the subgraphs containing only nodes that are not duplicated:

E = nx.union(A.subgraph(singles), B.subgraph(singles))
F = nx.union(C.subgraph(singles), D.subgraph(singles))
H = nx.union(E, F)

The graph H has all four initial graphs merged with duplicates removed.

The approach I've shown makes several intermediate graphs, so it is possible that, for large input graphs, you'll run into memory problems. If so, a similar approach could be done where you determine the set of duplicated nodes, delete those nodes from the original graphs, and then find the union without keeping all the intermediates. It looks like:

import collections
import networkx as nx

ctr = collections.Counter()
for G in [A, B, C, D]:
    ctr.update(G)

duplicates = {x for (x,n) in ctr.viewitems() if n > 1}
H = nx.Graph() 
for G in [A, B, C, D]:
    G.remove_nodes_from(duplicates) # change graphs in-place
    H = nx.union(H, G)

Both approaches take advantage of the way that NetworkX functions often allow extra nodes to be given and silently ignored.

like image 104
Michael J. Barber Avatar answered Nov 21 '25 07:11

Michael J. Barber



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!