I have got a dataframe like this:
part part_parent
0 part1 NaN
1 part2 part1
2 part3 part2
3 part4 part3
4 part5 part2
I need to add an additional column hierarchy like this:
part part_parent hierarchy
0 part1 NaN part1
1 part2 part1 part1/part2/
2 part3 part2 part1/part2/part3/
3 part4 part3 part1/part2/part3/part4
4 part5 part2 part1/part2/part5
Dict to create input/output dataframes:
from numpy import nan
df1 = pd.DataFrame({'part': {0: 'part1', 1: 'part2', 2: 'part3', 3: 'part4', 4: 'part5'},
'part_parent': {0: nan, 1: 'part1', 2: 'part2', 3: 'part3', 4: 'part2'}})
df2 = pd.DataFrame({'part': {0: 'part1', 1: 'part2', 2: 'part3', 3: 'part4', 4: 'part5'},
'part_parent': {0: nan, 1: 'part1', 2: 'part2', 3: 'part3', 4: 'part2'},
'hierarchy': {0: 'part1',
1: 'part1/part2/',
2: 'part1/part2/part3/',
3: 'part1/part2/part3/part4',
4: 'part1/part2/part5'}})
NOTE: I've seen a couple of threads related to NetworkX
to solve this issue but I'm not able to do so.
Any help is appreciated.
To make the column an index, we use the Set_index() function of pandas. If we want to make one column an index, we can simply pass the name of the column as a string in set_index(). If we want to do multi-indexing or Hierarchical Indexing, we pass the list of column names in the set_index().
pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.
Hierarchical indexing is one of the functions in pandas, a software library for the Python programming languages. pandas derives its name from the term “panel data”, a statistical term for four-dimensional data models that show changes over time.
To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order.
Here is a solution using networkx
. It treats nan
as the root node, and finds the shortest path to each node based on that.
import networkx as nx
def find_path(net, source, target):
# Adjust this as needed (in case multiple paths are present)
# or error handling in case a path doesn't exist
path = nx.shortest_path(net, source, target)
return "/".join(list(path)[1:])
net = nx.from_pandas_edgelist(df1, "part", "part_parent")
df1["hierarchy"] = [find_path(net, nan, node) for node in df1["part"]]
part part_parent hierarchy
0 part1 NaN part1
1 part2 part1 part1/part2
2 part3 part2 part1/part2/part3
3 part4 part3 part1/part2/part3/part4
4 part5 part2 part1/part2/part5
The formatting of the path is contrived for this example, if more robust error-handling or multiple path formatting is needed, the path finder will have to be adjusted.
Here is a recursive approach. It uses a Series that contains the parents for each element to find a given parent and walks back to the original parent until it finds NaN. At this point it returns the hierarchy.
NB. This will not work if you have a circular network or undefined parents (the latter can easily be fixed is needed)
import pandas as pd
parents = df1.set_index('part')['part_parent']
def hierarchy(e):
if not isinstance(e, list):
return hierarchy([e])
parent = parents[e[0]]
if pd.isna(parent):
return '/'.join(e)
return hierarchy([parent]+e)
df2 = df1.copy()
df2['hierarchy'] = df1['part'].apply(hierarchy)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With