Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas hierarchical sort

I have a dataframe of categories and amounts. Categories can be nested into sub categories an infinite levels using a colon separated string. I wish to sort it by descending amount. But in hierarchical type fashion like shown.

How I need it sorted

CATEGORY                            AMOUNT
Transport                           5000
Transport : Car                     4900
Transport : Train                   100
Household                           1100
Household : Utilities               600
Household : Utilities : Water       400
Household : Utilities : Electric    200
Household : Cleaning                100
Household : Cleaning : Bathroom     75
Household : Cleaning : Kitchen      25
Household : Rent                    400
Living                              250
Living : Other                      150
Living : Food                       100

EDIT: The data frame:

pd.DataFrame({
    "category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
    "amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})

Note: this is the order I want it. It may be in any arbitrary order before the sort.

EDIT2: If anyone looking for a similar solution I posted the one I settled on here: How to sort dataframe in pandas by value in hierarchical category structure

like image 902
Pierce Avatar asked Nov 16 '19 08:11

Pierce


2 Answers

One way could be to first str.split the category column.

df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
           0          1     2
0  Transport       None  None
1  Transport        Car  None
2  Transport      Train  None
3  Household       None  None
4  Household  Utilities  None

Then get the column amount and what you want is to get the maximum amount per group based on:

  • the first column alone,
  • then the first and the second columns
  • then the first-second and third columns, ...

You can do this with groupby.transform with max, and you concat each column created.

s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
                  for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
       0       1      2
0   5000     NaN    NaN
1   5000  4900.0    NaN
2   5000   100.0    NaN
3   1100     NaN    NaN
4   1100   600.0    NaN
5   1100   600.0  400.0
6   1100   600.0  200.0
7   1100   100.0    NaN
8   1100   100.0   75.0
9   1100   100.0   25.0
10  1100   400.0    NaN
11   250     NaN    NaN
12   250   150.0    NaN
13   250   100.0    NaN

Now you just need to sort_values on all columns in the right order on first 0, then 1, then 2..., get the index and use loc to order df in the expected way

dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
                            category  amount
0                          Transport    5000
1                    Transport : Car    4900
2                  Transport : Train     100
3                          Household    1100
4              Household : Utilities     600
5      Household : Utilities : Water     400
6   Household : Utilities : Electric     200
10                  Household : Rent     400 #here is the one difference with this data
7               Household : Cleaning     100
8    Household : Cleaning : Bathroom      75
9     Household : Cleaning : Kitchen      25
11                            Living     250
12                    Living : Other     150
13                     Living : Food     100
like image 170
Ben.T Avatar answered Sep 21 '22 21:09

Ben.T


I packaged @Ben. T's answer into a more generic function, hopefully this is clearer to read!

EDIT: I have made changes to the function to group by columns in order rather than one by one to address potential issues noted by @Ben. T in the comments.

import pandas as pd

def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
    '''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
    Sorts numeric columns in descending order by default.

    Returns a copy.'''
    df = df.copy()
    try:
        to_sort = pd.to_numeric(df[numeric_col])
    except ValueError:
        print(f'Column `{numeric_col}` is not numeric!')
        raise
    categories = df[category_col].str.split(sep, expand=True)
    # Strips any white space before and after sep
    categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
    levels = list(categories.columns)
    to_concat = []
    for level in levels:
        # Group by columns in order rather than one at a time
        level_by = [df_[col] for col in range(0, level+1)]
        gb = to_sort.groupby(level_by)
        to_concat.append(gb.transform('max'))
    dfa = pd.concat(to_concat, keys=levels, axis=1)
    ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
    df = df.loc[ixs].copy()
    return df

Using Python 3.7.3, pandas 0.24.2

like image 27
Nick C. Avatar answered Sep 20 '22 21:09

Nick C.