Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to monitor status of networkx graph creation?

I have a data set that is a csv/txt file representing a network. Each line in the file contains two node name separated by a comma. My data file contacts about 330k nodes and about 550k edges. I am trying to create just a very rudimentary graph of this (yes, I know it will be very cluttered) using the following code:

import networkx as nx
import matplotlib.pyplot as plt
import sys
import numpy as np

f = open('dataFile.txt', 'rb')
G = nx.read_edgelist(f, delimiter=',', nodetype=str)
f.close()

print(nx.number_of_nodes(G))
print(nx.number_of_edges(G))

plt.figure(1)
nx.draw(G)
plt.savefig("graph.pdf")

I am running this on an AWS EC2 m4.4xlarge instance and it is pegging at 100% of the CPUs and only 1% of the memory.

I am skeptical by that since I thought networkx was memory intensive, not a CPU hog. Right now, it is spinning on the nx.draw command. Is there any way I can monitor how far into the graph generation it is?

like image 668
CJ Sullivan Avatar asked Jan 07 '23 01:01

CJ Sullivan


1 Answers

Networkx is really not suited for the task. It is very slow. In addition, matplotlib (nx.draw) will never succeed to draw that many objects.

If you want to visualize you will need a tool to see each step of the layout where you could possibly modify what's going on.

Even though it is buggy, I would recommend Gephi for this. The only layout algorithm which works for large graphs is OpenOrd (Gephi plug-ins). Don't forget not to show edges while you run the algorithm.

As a general purpose library to handle your scale of graphs I would recommend graph-tool. With a C++ backend and a python interface it is much faster than networkx. The drawing is also better.

Finally when you reach a million node scale, you can switch to large graph-analytics frameworks such as Graphlab-Create or Apache GraphX.

like image 145
Kirell Avatar answered Jan 14 '23 23:01

Kirell