I have a pandas dataframe of the following structure:
d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)
I need to create a dict of dict to get data of all existing edges (indicated by nonzero values) between nodes:
{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}
I need it to create a network graph using Networkx library and perform some calculations on it. Obviously it would be possible to loop over every cell in the data frame to do this but my data is quite large and it would be inefficient. I'm looking for some better way possibly using vectorization and/or list comprehension. I've tried list comprehension but I'm stuck and cannot make it work. Can anyone suggest a more efficient way to do this please?
You can do this by combining df.iterrows()
with a dictionary comprehension. Although iterrows()
is not truly vectorized, it's still reasonably efficient for this kind of task and cleaner than using manual nested loops. For example, you could write:
edge_dictionary = {
node: {attribute: {weight} for attribute, weight in attributes.items() if weight != 0}
for node, attributes in df.iterrows()
}
If your DataFrame is very large and you’re concerned about performance, another approach is to first convert it into a plain dictionary of dictionaries using df.to_dict(orient='index')
and then filter out the zeros. That would look like thiss:
data_dictionary = df.to_dict(orient='index')
edge_dictionary = {
node: {attribute: {weight} for attribute, weight in connections.items() if weight != 0}
for node, connections in data_dict.items()
}
It seems my version is similar to @VictorSbruev but his idea with converting all to dictionary seems better.
I was thinking about using .apply(function, axis=1)
to run code on every row and create column with inner dictionaries
def convert(row):
data = row.to_dict()
# skip `0` and convert value to `set()`
data = {key:{val} for key, val in data.items() if val != 0}
return data
df['networkx'] = df.apply(convert, axis=1)
to get
A {'X': {1}, 'Z': {1}, 'W': {3}}
B {'Y': {1}, 'W': {2}}
C {'X': {3}, 'Y': {2}}
D {'X': {1}, 'Y': {1}}
Name: networkx, dtype: object
And later convert this column to dictionary
result = df['networkx'].to_dict()
which gives me expected
{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}
Full working code where I was testing different versions
import pandas as pd
d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)
# for test
expected = {'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}
print(df)
def convert(row):
#print(row)
data = row.to_dict()
#data = {row.name: {key:{val} for key, val in data.items() if val != 0}} # version 1
data = {key:{val} for key, val in data.items() if val != 0} # version 2
return data
df['networkx'] = df.apply(convert, axis=1)
print(df['networkx'])
#print(list(df['networkx'].items()))
#result = {name:item[name] for name,item in df['networkx'].items()} # for version 1
#result = {name:item for name,item in df['networkx'].items()} # for version 2
result = df['networkx'].to_dict() # for version 2
print('result :', result)
print('expected:', expected)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With