Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load nodes with attributes and edges from DataFrame to NetworkX

I am new using Python for working with graphs: NetworkX. Until now I have used Gephi. There the standard steps (but not the only possible) are:

  1. Load the nodes informations from a table/spreadsheet; one of the columns should be ID and the rest are metadata about the nodes (nodes are people, so gender, groups... normally to be used for coloring). Like:

    id;NormalizedName;Gender
    per1;Jesús;male
    per2;Abraham;male
    per3;Isaac;male
    per4;Jacob;male
    per5;Judá;male
    per6;Tamar;female
    ...
    
  2. Then load the edges also from a table/spreadsheet, using the same names for the nodes as it was in the column ID of the nodes spreadsheet with normally four columns (Target, Source, Weight and Type):

    Target;Source;Weight;Type
    per1;per2;3;Undirected
    per3;per4;2;Undirected
    ...
    

This are the two dataframes that I have and that I want to load in Python. Reading about NetworkX, it seems that it's not quite possible to load two tables (one for nodes, one for edges) into the same graph and I am not sure what would be the best way:

  1. Should I create a graph only with the nodes informations from the DataFrame, and then add (append) the edges from the other DataFrame? If so and since nx.from_pandas_dataframe() expects information about the edges, I guess I shouldn't use it to create the nodes... Should I just pass the information as lists?

  2. Should I create a graph only with the edges information from the DataFrame and then add to each node the information from the other DataFrame as attributes? Is there a better way for doing that than iterating over the DataFrame and the nodes?

like image 499
José Avatar asked Mar 02 '17 14:03

José


People also ask

What is Nbunch in NetworkX?

nbunch. An nbunch is a single node, container of nodes or None (representing all nodes). It can be a list, set, graph, etc.. To filter an nbunch so that only nodes actually in G appear, use G.

How do I add edges to NetworkX?

Add an edge between u and v. The nodes u and v will be automatically added if they are not already in the graph. Edge attributes can be specified with keywords or by directly accessing the edge's attribute dictionary.

Which data type can be used as the content of a node in NetworkX?

In NetworkX, nodes can be any hashable object e.g., a text string, an image, an XML object, another Graph, a customized node object, etc. Python's None object is not allowed to be used as a node.

Can NetworkX handle large graphs?

For NetworkX, a graph with more than 100K nodes may be too large. I'll demonstrate that it can handle a network with 187K nodes in this post, but the centrality calculations were prolonged. Luckily, there are some other packages available to help us with even larger graphs.


2 Answers

Create the weighted graph from the edge table using nx.from_pandas_dataframe:

import networkx as nx
import pandas as pd

edges = pd.DataFrame({'source' : [0, 1],
                      'target' : [1, 2],
                      'weight' : [100, 50]})

nodes = pd.DataFrame({'node' : [0, 1, 2],
                      'name' : ['Foo', 'Bar', 'Baz'],
                      'gender' : ['M', 'F', 'M']})

G = nx.from_pandas_dataframe(edges, 'source', 'target', 'weight')

Then add the node attributes from dictionaries using set_node_attributes:

nx.set_node_attributes(G, 'name', pd.Series(nodes.name, index=nodes.node).to_dict())
nx.set_node_attributes(G, 'gender', pd.Series(nodes.gender, index=nodes.node).to_dict())

Or iterate over the graph to add the node attributes:

for i in sorted(G.nodes()):
    G.node[i]['name'] = nodes.name[i]
    G.node[i]['gender'] = nodes.gender[i]

Update:

As of nx 2.0 the argument order of nx.set_node_attributes has changed: (G, values, name=None)

Using the example from above:

nx.set_node_attributes(G, pd.Series(nodes.gender, index=nodes.node).to_dict(), 'gender')

And as of nx 2.4, G.node[] is replaced by G.nodes[].

like image 144
harryscholes Avatar answered Sep 17 '22 04:09

harryscholes


Here's basically the same answer, but updated with some details filled in. We'll start with basically the same setup, but here there won't be indices for the nodes, just names to address @LancelotHolmes comment and make it more general:

import networkx as nx
import pandas as pd

linkData = pd.DataFrame({'source' : ['Amy', 'Bob'],
                  'target' : ['Bob', 'Cindy'],
                  'weight' : [100, 50]})

nodeData = pd.DataFrame({'name' : ['Amy', 'Bob', 'Cindy'],
                  'type' : ['Foo', 'Bar', 'Baz'],
                  'gender' : ['M', 'F', 'M']})

G = nx.from_pandas_edgelist(linkData, 'source', 'target', True, nx.DiGraph())

Here the True parameter tells NetworkX to keep all the properties in the linkData as link properties. In this case I've made it a DiGraph type, but if you don't need that, then you can make it another type in the obvious way.

Now, since you need to match the nodeData by the name of the nodes generated from the linkData, you need to set the index of the nodeData dataframe to be the name property, before making it a dictionary so that NetworkX 2.x can load it as the node attributes.

nx.set_node_attributes(G, nodeData.set_index('name').to_dict('index'))

This loads the whole nodeData dataframe into a dictionary in which the key is the name, and the other properties are key:value pairs within that key (i.e., normal node properties where the node index is its name).

like image 43
Aaron Bramson Avatar answered Sep 18 '22 04:09

Aaron Bramson