I have got a dataframe like this:
    part part_parent
0  part1         NaN
1  part2       part1
2  part3       part2
3  part4       part3
4  part5       part2
I need to add an additional column hierarchy like this:
    part part_parent                hierarchy
0  part1         NaN                    part1
1  part2       part1             part1/part2/
2  part3       part2       part1/part2/part3/
3  part4       part3  part1/part2/part3/part4
4  part5       part2        part1/part2/part5
Dict to create input/output dataframes:
from numpy import nan
df1 = pd.DataFrame({'part': {0: 'part1', 1: 'part2', 2: 'part3', 3: 'part4', 4: 'part5'},
 'part_parent': {0: nan, 1: 'part1', 2: 'part2', 3: 'part3', 4: 'part2'}})
df2 = pd.DataFrame({'part': {0: 'part1', 1: 'part2', 2: 'part3', 3: 'part4', 4: 'part5'},
 'part_parent': {0: nan, 1: 'part1', 2: 'part2', 3: 'part3', 4: 'part2'},
 'hierarchy': {0: 'part1',
  1: 'part1/part2/',
  2: 'part1/part2/part3/',
  3: 'part1/part2/part3/part4',
  4: 'part1/part2/part5'}})
NOTE: I've seen a couple of threads related to NetworkX to solve this issue but I'm not able to do so.
Any help is appreciated.
To make the column an index, we use the Set_index() function of pandas. If we want to make one column an index, we can simply pass the name of the column as a string in set_index(). If we want to do multi-indexing or Hierarchical Indexing, we pass the list of column names in the set_index().
pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.
Hierarchical indexing is one of the functions in pandas, a software library for the Python programming languages. pandas derives its name from the term “panel data”, a statistical term for four-dimensional data models that show changes over time.
To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order.
Here is a solution using networkx.  It treats nan as the root node, and finds the shortest path to each node based on that.
import networkx as nx
def find_path(net, source, target):
    # Adjust this as needed (in case multiple paths are present)
    # or error handling in case a path doesn't exist
    path = nx.shortest_path(net, source, target)
    return "/".join(list(path)[1:])
net = nx.from_pandas_edgelist(df1, "part", "part_parent")
df1["hierarchy"] = [find_path(net, nan, node) for node in df1["part"]]
    part part_parent                hierarchy
0  part1         NaN                    part1
1  part2       part1              part1/part2
2  part3       part2        part1/part2/part3
3  part4       part3  part1/part2/part3/part4
4  part5       part2        part1/part2/part5
The formatting of the path is contrived for this example, if more robust error-handling or multiple path formatting is needed, the path finder will have to be adjusted.
Here is a recursive approach. It uses a Series that contains the parents for each element to find a given parent and walks back to the original parent until it finds NaN. At this point it returns the hierarchy.
NB. This will not work if you have a circular network or undefined parents (the latter can easily be fixed is needed)
import pandas as pd
parents = df1.set_index('part')['part_parent']
def hierarchy(e):
    if not isinstance(e, list):
        return hierarchy([e])
    parent = parents[e[0]]
    if pd.isna(parent):
        return '/'.join(e)
    return hierarchy([parent]+e)
df2 = df1.copy()
df2['hierarchy'] = df1['part'].apply(hierarchy)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With