Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Python to generate a connection/network graph

I have a text file with about 8.5 million data points in the form:

Company 87178481
Company 893489
Company 2345788
[...]

I want to use Python to create a connection graph to see what the network between companies looks like. From the above sample, two companies would share an edge if the value in the second column is the same (clarification from/for Hooked).

I've been using the NetworkX package and have been able to generate a network for a few thousand points, but it's not making it through the full 8.5 million-node text file. I ran it and left for about 15 hours, and when I came back, the cursor in the shell was still blinking, but there was no output graph.

Is it safe to assume that it was still running? Is there a better/faster/easier approach to graph millions of points?

like image 337
Jon Avatar asked Oct 25 '12 16:10

Jon


People also ask

Can Python generate graphs?

You can create many different types of plots and charts with Matplotlib. It also integrates well with other data science and math libraries like NumPy and pandas. You will also find that Matplotlib works with most of Python's GUI toolkits, such as Tkinter, wxPython and PyQt.

How do you create a directed graph in Python?

Add the nodes from any container (a list, dict, set or even the lines from a file or the nodes from another graph). In addition to strings and integers any hashable Python object (except None) can represent a node, e.g. a customized node object, or even another Graph. Edges: G can also be grown by adding edges.


1 Answers

If you have 1000K points of data, you'll need some way of looking at the broad picture. Depending on what you are looking for exactly, if you can assign a "distance" between companies (say number of connections apart) you can visualize relationships (or clustering) via a Dendrogram.

Scipy does clustering:

http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html#module-scipy.cluster.hierarchy

and has a function to turn them into dendrograms for visualization:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram

An example for a shortest path distance function via networkx:

http://networkx.lanl.gov/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path

Ultimately you'll have to decide how you want to weight the distance between two companies (vertices) in your graph.

like image 120
Hooked Avatar answered Nov 09 '22 23:11

Hooked