Consider the dataframe df
df = pd.DataFrame(dict(
A=list('XXYYXXYY'),
B=range(8, 0, -1)
))
print(df)
A B
0 X 8
1 X 7
2 Y 6
3 Y 5
4 X 4
5 X 3
6 Y 2
7 Y 1
With the 'X'
group defined by column 'A'
, I want to sort [8, 7, 4, 3]
to the expected [3, 4, 7, 8]
. However, I want to leave those rows where they are.
A B
5 X 3 <-- Notice all X are in same positions
4 X 4 <-- However, `[3, 4, 7, 8]` have shifted
7 Y 1
6 Y 2
1 X 7 <--
0 X 8 <--
3 Y 5
2 Y 6
You can use transform
to get back your new desired index order, then use reindex
to reorder your DataFrame:
# Use transform to return the new ordered index values.
new_idx = df.groupby('A')['B'].transform(lambda grp: grp.sort_values().index)
# Reindex.
df = df.reindex(new_idx.rename(None))
You could combine the two lines above into one long line, if so desired.
The resulting output:
A B
5 X 3
4 X 4
7 Y 1
6 Y 2
1 X 7
0 X 8
3 Y 5
2 Y 6
Note that if you don't care about maintaing your old index, you can directly reassign from the transform
:
df['B'] = df.groupby('A')['B'].transform(lambda grp: grp.sort_values())
Which yields:
A B
0 X 3
1 X 4
2 Y 1
3 Y 2
4 X 7
5 X 8
6 Y 5
7 Y 6
The only way I figured how to solve this efficiently was to sort twice and unwind once.
v = df.values
# argsort just first column with kind='mergesort' to preserve subgroup order
a1 = v[:, 0].argsort(kind='mergesort')
# Fill in an un-sort array to unwind the `a1` argsort
a_ = np.empty_like(a1)
a_[a1] = np.arange(len(a1))
# argsort by both columns... not exactly what I want, yet.
a2 = np.lexsort(v.T[::-1])
# Sort with `a2` then unwind the first layer with `a_`
pd.DataFrame(v[a2][a_], df.index[a2][a_], df.columns)
A B
5 X 3
4 X 4
7 Y 1
6 Y 2
1 X 7
0 X 8
3 Y 5
2 Y 6
Testing
Code
def np_intra_sort(df):
v = df.values
a1 = v[:, 0].argsort(kind='mergesort')
a_ = np.empty_like(a1)
a_[a1] = np.arange(len(a1))
a2 = np.lexsort(v.T[::-1])
return pd.DataFrame(v[a2][a_], df.index[a2][a_], df.columns)
def pd_intra_sort(df):
def sub_sort(x):
return x.sort_values().index
idx = df.groupby('A').B.transform(sub_sort).values
return df.reindex(idx)
Small data
Large data
df = pd.DataFrame(dict(
A=list('XXYYXXYY') * 10000,
B=range(8 * 10000, 0, -1)
))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With