Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupwise sorting in pandas

I want to sort an array within the group boundaries defined in another array. The groups are not presorted in any way and need to remain unchanged after the sorting. In numpy terms it would look like this:

import numpy as np

def groupwise_sort(group_idx, a, reverse=False):
    sortidx = np.lexsort((-a if reverse else a, group_idx))
    # Reverse sorting back to into grouped order, but preserving groupwise sorting
    revidx = np.argsort(np.argsort(group_idx, kind='mergesort'), kind='mergesort')
    return a[sortidx][revidx]

group_idx =   np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a =           np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
groupwise_sort(group_idx, a)
# >>>            array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(group_idx, a, reverse=True)
# >>>            array([3, 7, 1, 5, 4, 9, 2, 5, 1])

How can I do the same with pandas? I saw df.groupby() and df.sort_values(), though I couldn't find a straight forward way to achieve the same sorting. And a fast one, if possible.

like image 236
Michael Avatar asked Apr 22 '18 16:04

Michael


People also ask

How do I sort by Groupby on pandas?

To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.

How do you sort a DataFrame in Groupby object?

Sort within Groups of groupby() Result in DataFrame By using DataFrame. sort_values() , you can sort DataFrame in ascending or descending order, before you use this first group the DataFrame rows by using DataFrame. groupby() method. Note that groupby preserves the order of rows within each group.

How do you sort pandas from lowest to highest?

Sort a pandas DataFrame by the values of one or more columns. Use the ascending parameter to change the sort order. Sort a DataFrame by its index using . sort_index()

How do I sort a Groupby pandas descending order?

To group Pandas dataframe, we use groupby(). To sort grouped dataframe in descending order, use sort_values(). The size() method is used to get the dataframe size.


Video Answer


1 Answers

Let us first set the stage:

import pandas as pd
import numpy as np

group_idx =   np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a =           np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])

df = pd.DataFrame({'group': group_idx, 'values': a})
df
#   group  values
#0      3       3
#1      2       2
#2      3       1
#3      2       7
#4      2       4
#5      1       5
#6      2       5
#7      1       9
#8      1       1

To get a dataframe sorted by group and values (within groups):

df.sort_values(["group", "values"])

#   group  values
#8      1       1
#5      1       5
#7      1       9
#1      2       2
#4      2       4
#6      2       5
#3      2       7
#2      3       1
#0      3       3

To sort the values in descending order, use ascending = False. To apply different orders to different columns, you can supply a list:

df.sort_values(["group", "values"], ascending = [True, False])

#   group  values
#7      1       9
#5      1       5
#8      1       1
#3      2       7
#6      2       5
#4      2       4
#1      2       2
#0      3       3
#2      3       1

Here, groups are sorted in ascending order, and the values within each group are sorted in descending order.

To only sort values for contiguous rows belonging to the same group, create a new group indicator:

(I keep this in here for reference since it might be helpful for others. I wrote this in an earlier version before the OP clarified his question in the comments.)

df['new_grp'] = (df.group.diff(1) != 0).astype('int').cumsum()
df
#   group  values  new_grp
#0      3       3        1
#1      2       2        2
#2      3       1        3
#3      2       7        4
#4      2       4        4
#5      1       5        5
#6      2       5        6
#7      1       9        7
#8      1       1        7

We can then easily sort with new_grp instead of group, leaving the original order of groups untouched.

Ordering within groups but keeping the group-specifing row-positions:

To sort the elements of each group but keep the group-specific positions in the dataframe, we need to keep track of the original row numbers. For instance, the following will do the trick:

# First, create an indicator for the original row-number:

df["ind"] = range(len(df))

# Now, sort the dataframe as before
df_sorted = df.sort_values(["group", "values"])

# sort the original row-numbers within each group
newindex = df.groupby("group").apply(lambda x: x.sort_values(["ind"]))["ind"].values

# assign the sorted row-numbers to the sorted dataframe
df_sorted["ind"] = newindex

# Sort based on the row-numbers:
sorted_asc = df_sorted.sort_values("ind")

# compare the resulting order of values with your desired output:
np.array(sorted_asc["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])

This is easier to test and profile when written up in a function, so let's do that:

def sort_my_frame(frame, groupcol = "group", valcol = "values", asc = True):

    frame["ind"] = range(len(frame))
    frame_sorted = frame.sort_values([groupcol, valcol], ascending = [True, asc])
    ind_sorted = frame.groupby(groupcol).apply(lambda x: x.sort_values(["ind"]))["ind"].values
    frame_sorted["ind"] = ind_sorted
    frame_sorted = frame_sorted.sort_values(["ind"])

    return(frame_sorted.drop(columns = "ind"))

np.array(sort_my_frame(df, "group", "values", asc = True)["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
np.array(sort_my_frame(df, "group", "values", asc = False)["values"])
# array([3, 7, 1, 5, 4, 9, 2, 5, 1])

Note that the latter results match your desired outcome.

I am sure this can be written up in a more succinct way. For instance, if the index of your dataframe is already ordered, you can use that one instead of the indicator ind I create (i.e., following @DJK's comment, we can use sort_index instead of sort_values and avoid assigning an additional column). In any case, the above highlights one possible solution and how to approach it. An alternative would be to use your numpy functions and wrap the output around a pd.DataFrame.

like image 169
coffeinjunky Avatar answered Oct 23 '22 20:10

coffeinjunky