I want to sort an array within the group boundaries defined in another array. The groups are not presorted in any way and need to remain unchanged after the sorting. In numpy
terms it would look like this:
import numpy as np
def groupwise_sort(group_idx, a, reverse=False):
sortidx = np.lexsort((-a if reverse else a, group_idx))
# Reverse sorting back to into grouped order, but preserving groupwise sorting
revidx = np.argsort(np.argsort(group_idx, kind='mergesort'), kind='mergesort')
return a[sortidx][revidx]
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
groupwise_sort(group_idx, a)
# >>> array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(group_idx, a, reverse=True)
# >>> array([3, 7, 1, 5, 4, 9, 2, 5, 1])
How can I do the same with pandas
? I saw df.groupby()
and df.sort_values()
, though I couldn't find a straight forward way to achieve the same sorting. And a fast one, if possible.
To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.
Sort within Groups of groupby() Result in DataFrame By using DataFrame. sort_values() , you can sort DataFrame in ascending or descending order, before you use this first group the DataFrame rows by using DataFrame. groupby() method. Note that groupby preserves the order of rows within each group.
Sort a pandas DataFrame by the values of one or more columns. Use the ascending parameter to change the sort order. Sort a DataFrame by its index using . sort_index()
To group Pandas dataframe, we use groupby(). To sort grouped dataframe in descending order, use sort_values(). The size() method is used to get the dataframe size.
Let us first set the stage:
import pandas as pd
import numpy as np
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
df = pd.DataFrame({'group': group_idx, 'values': a})
df
# group values
#0 3 3
#1 2 2
#2 3 1
#3 2 7
#4 2 4
#5 1 5
#6 2 5
#7 1 9
#8 1 1
df.sort_values(["group", "values"])
# group values
#8 1 1
#5 1 5
#7 1 9
#1 2 2
#4 2 4
#6 2 5
#3 2 7
#2 3 1
#0 3 3
To sort the values in descending order, use ascending = False
. To apply different orders to different columns, you can supply a list:
df.sort_values(["group", "values"], ascending = [True, False])
# group values
#7 1 9
#5 1 5
#8 1 1
#3 2 7
#6 2 5
#4 2 4
#1 2 2
#0 3 3
#2 3 1
Here, groups are sorted in ascending order, and the values within each group are sorted in descending order.
(I keep this in here for reference since it might be helpful for others. I wrote this in an earlier version before the OP clarified his question in the comments.)
df['new_grp'] = (df.group.diff(1) != 0).astype('int').cumsum()
df
# group values new_grp
#0 3 3 1
#1 2 2 2
#2 3 1 3
#3 2 7 4
#4 2 4 4
#5 1 5 5
#6 2 5 6
#7 1 9 7
#8 1 1 7
We can then easily sort with new_grp
instead of group
, leaving the original order of groups untouched.
To sort the elements of each group but keep the group-specific positions in the dataframe, we need to keep track of the original row numbers. For instance, the following will do the trick:
# First, create an indicator for the original row-number:
df["ind"] = range(len(df))
# Now, sort the dataframe as before
df_sorted = df.sort_values(["group", "values"])
# sort the original row-numbers within each group
newindex = df.groupby("group").apply(lambda x: x.sort_values(["ind"]))["ind"].values
# assign the sorted row-numbers to the sorted dataframe
df_sorted["ind"] = newindex
# Sort based on the row-numbers:
sorted_asc = df_sorted.sort_values("ind")
# compare the resulting order of values with your desired output:
np.array(sorted_asc["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
This is easier to test and profile when written up in a function, so let's do that:
def sort_my_frame(frame, groupcol = "group", valcol = "values", asc = True):
frame["ind"] = range(len(frame))
frame_sorted = frame.sort_values([groupcol, valcol], ascending = [True, asc])
ind_sorted = frame.groupby(groupcol).apply(lambda x: x.sort_values(["ind"]))["ind"].values
frame_sorted["ind"] = ind_sorted
frame_sorted = frame_sorted.sort_values(["ind"])
return(frame_sorted.drop(columns = "ind"))
np.array(sort_my_frame(df, "group", "values", asc = True)["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
np.array(sort_my_frame(df, "group", "values", asc = False)["values"])
# array([3, 7, 1, 5, 4, 9, 2, 5, 1])
Note that the latter results match your desired outcome.
I am sure this can be written up in a more succinct way. For instance, if the index of your dataframe
is already ordered, you can use that one instead of the indicator ind
I create (i.e., following @DJK's comment, we can use sort_index
instead of sort_values
and avoid assigning an additional column). In any case, the above highlights one possible solution and how to approach it. An alternative would be to use your numpy
functions and wrap the output around a pd.DataFrame
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With