Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create hierarchy column in pandas

I have got a dataframe like this:

    part part_parent
0  part1         NaN
1  part2       part1
2  part3       part2
3  part4       part3
4  part5       part2

I need to add an additional column hierarchy like this:

    part part_parent                hierarchy
0  part1         NaN                    part1
1  part2       part1             part1/part2/
2  part3       part2       part1/part2/part3/
3  part4       part3  part1/part2/part3/part4
4  part5       part2        part1/part2/part5

Dict to create input/output dataframes:

from numpy import nan

df1 = pd.DataFrame({'part': {0: 'part1', 1: 'part2', 2: 'part3', 3: 'part4', 4: 'part5'},
 'part_parent': {0: nan, 1: 'part1', 2: 'part2', 3: 'part3', 4: 'part2'}})


df2 = pd.DataFrame({'part': {0: 'part1', 1: 'part2', 2: 'part3', 3: 'part4', 4: 'part5'},
 'part_parent': {0: nan, 1: 'part1', 2: 'part2', 3: 'part3', 4: 'part2'},
 'hierarchy': {0: 'part1',
  1: 'part1/part2/',
  2: 'part1/part2/part3/',
  3: 'part1/part2/part3/part4',
  4: 'part1/part2/part5'}})

NOTE: I've seen a couple of threads related to NetworkX to solve this issue but I'm not able to do so.

Any help is appreciated.

like image 838
Shubham Sharma Avatar asked Jul 26 '21 13:07

Shubham Sharma


People also ask

How do I create a hierarchical column in pandas?

To make the column an index, we use the Set_index() function of pandas. If we want to make one column an index, we can simply pass the name of the column as a string in set_index(). If we want to do multi-indexing or Hierarchical Indexing, we pass the list of column names in the set_index().

How do I create a MultiIndex column in pandas?

pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.

What is Panda hierarchical index?

Hierarchical indexing is one of the functions in pandas, a software library for the Python programming languages. pandas derives its name from the term “panel data”, a statistical term for four-dimensional data models that show changes over time.

How do I arrange my ascending order in pandas?

To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order.


Video Answer


2 Answers

Here is a solution using networkx. It treats nan as the root node, and finds the shortest path to each node based on that.

import networkx as nx

def find_path(net, source, target):
    # Adjust this as needed (in case multiple paths are present)
    # or error handling in case a path doesn't exist
    path = nx.shortest_path(net, source, target)
    return "/".join(list(path)[1:])

net = nx.from_pandas_edgelist(df1, "part", "part_parent")
df1["hierarchy"] = [find_path(net, nan, node) for node in df1["part"]]

    part part_parent                hierarchy
0  part1         NaN                    part1
1  part2       part1              part1/part2
2  part3       part2        part1/part2/part3
3  part4       part3  part1/part2/part3/part4
4  part5       part2        part1/part2/part5

The formatting of the path is contrived for this example, if more robust error-handling or multiple path formatting is needed, the path finder will have to be adjusted.

like image 153
user3483203 Avatar answered Oct 23 '22 17:10

user3483203


Here is a recursive approach. It uses a Series that contains the parents for each element to find a given parent and walks back to the original parent until it finds NaN. At this point it returns the hierarchy.

NB. This will not work if you have a circular network or undefined parents (the latter can easily be fixed is needed)

import pandas as pd

parents = df1.set_index('part')['part_parent']
def hierarchy(e):
    if not isinstance(e, list):
        return hierarchy([e])
    parent = parents[e[0]]
    if pd.isna(parent):
        return '/'.join(e)
    return hierarchy([parent]+e)

df2 = df1.copy()
df2['hierarchy'] = df1['part'].apply(hierarchy)
like image 1
mozway Avatar answered Oct 23 '22 17:10

mozway