I want to sort the first column according to the internal algorithm, and then sort the second column according to the custom sorting method
The test data is as follows:
A B
Ankang Shaanxi Ankang Southeast
Baoding Anguo Baoding Anguo Northeast
Baoding Anguo Baoding Anguo Southeast
Changsha Hunan Changsha Hunan Bright
Ankang Shaanxi Ankang Northeast
Baoding Anguo Baoding Anguo Southwest
Baoding Anguo Baoding Anguo Upper
Ankang Shaanxi Ankang Southwest
Luoyang Henan Luoyang Henan Upper
Baoding Anguo Baoding Anguo Northwest
Changsha Hunan Changsha Hunan Lower
Ankang Shaanxi Ankang Southwest Upper
Ankang Shaanxi Ankang Northwest
I hope to be able to arrange it as shown below
The first column is sorted together using pandas' built-in string sorting algorithm, and then the second column is sorted using the custom order algorithm of northeast, southeast, northwest, southwest,upper.
I used pandas' sort_values() method to sort. I had no problem sorting a single column, but it always failed when I tried to sort two columns together.
import pandas as pd
data={'A':['Ankang Shaanxi','Baoding Anguo','Baoding Anguo','Changsha Hunan','Ankang Shaanxi',
'Baoding Anguo','Baoding Anguo','Ankang Shaanxi','Luoyang Henan','Baoding Anguo',
'Changsha Hunan','Ankang Shaanxi','Ankang Shaanxi'],
'B':['Ankang Southeast','Baoding Anguo Northeast','Baoding Anguo Southeast','Changsha Hunan Bright','Ankang Northeast','Baoding Anguo Southwest','Baoding Anguo Upper','Ankang Southwest','Luoyang Henan Upper','Baoding Anguo Northwest','Changsha Hunan Lower','Ankang Southwest Upper','Ankang Northwest']}
df=pd.DataFrame(data)
def sort_fun(x):
return x.split()[-1]
df['sort_value']=df['B'].apply(sort_fun)
sort_dicts={'Northeast':0,'Southeast':1,'Northwest':2,'Southwest':3,'Upper':4}
df.sort_values(by=['A','sort_value'],key=lambda x :x.map(sort_dicts))
I referred to it Pandas: How to custom-sort on multiple columns?
A B
Ankang Shaanxi Ankang Northeast
Ankang Shaanxi Ankang Southeast
Ankang Shaanxi Ankang Northwest
Ankang Shaanxi Ankang Southwest
Ankang Shaanxi Ankang Southwest Upper
Baoding Anguo Baoding Anguo Northeast
Baoding Anguo Baoding Anguo Southeast
Baoding Anguo Baoding Anguo Northwest
Baoding Anguo Baoding Anguo Southwest
Baoding Anguo Baoding Anguo Upper
Changsha Hunan Changsha Hunan Bright
Changsha Hunan Changsha Hunan Lower
Luoyang Henan Luoyang Henan Upper
The basic logic you can use for column 'B':
Series.str.split + access str[-1] + Series.mapdf['B'].str.split().str[-1].map(sort_dicts)
0 1.0
1 0.0
2 1.0
3 NaN
4 0.0
5 3.0
6 4.0
7 3.0
8 4.0
9 2.0
10 NaN
11 4.0
12 2.0
Name: B, dtype: float64
Couple of ways to sort using this logic:
Option 1
Chain calls to df.sort_values:
# note 'B' first
def sort_fun(s):
return s.str.split().str[-1].map(sort_dicts)
out = (df.sort_values('B', key=sort_fun)
.sort_values('A', ignore_index=True)
)
Option 2
Adjust sort_fun to only affect col 'B':
def sort_fun2(s, name):
if s.name == name: # for 'B'
return s.str.split().str[-1].map(sort_dicts)
return s
out2 = df.sort_values(['A', 'B'], key=lambda x: sort_fun2(x, 'B'),
ignore_index=True)
Indeed, your original approach also applied the function passed to key to df['A'] (i.e., df['A'].map(sort_dicts)), leading to a series with NaN values to "sort".
Option 3
Use np.lexsort as suggested by @mozway in the linked post:
# again: note 'B' goes first
import numpy as np
sort = np.lexsort((df['B'].str.split().str[-1].map(sort_dicts),
df['A']))
out3 = df.iloc[sort].reset_index(drop=True)
Output
out
A B
0 Ankang Shaanxi Ankang Northeast
1 Ankang Shaanxi Ankang Southeast
2 Ankang Shaanxi Ankang Northwest
3 Ankang Shaanxi Ankang Southwest
4 Ankang Shaanxi Ankang Southwest Upper
5 Baoding Anguo Baoding Anguo Northeast
6 Baoding Anguo Baoding Anguo Southeast
7 Baoding Anguo Baoding Anguo Northwest
8 Baoding Anguo Baoding Anguo Southwest
9 Baoding Anguo Baoding Anguo Upper
10 Changsha Hunan Changsha Hunan Bright
11 Changsha Hunan Changsha Hunan Lower
12 Luoyang Henan Luoyang Henan Upper
Equality check with desired output:
data2 = {'A': ['Ankang Shaanxi', 'Ankang Shaanxi', 'Ankang Shaanxi',
'Ankang Shaanxi', 'Ankang Shaanxi', 'Baoding Anguo',
'Baoding Anguo', 'Baoding Anguo', 'Baoding Anguo',
'Baoding Anguo', 'Changsha Hunan', 'Changsha Hunan',
'Luoyang Henan'],
'B': ['Ankang Northeast', 'Ankang Southeast', 'Ankang Northwest',
'Ankang Southwest', 'Ankang Southwest Upper',
'Baoding Anguo Northeast', 'Baoding Anguo Southeast',
'Baoding Anguo Northwest', 'Baoding Anguo Southwest',
'Baoding Anguo Upper', 'Changsha Hunan Bright',
'Changsha Hunan Lower', 'Luoyang Henan Upper']}
desired = pd.DataFrame(data2)
all(df.equals(desired) for df in [out, out2, out3])
# True
You can assign an new column based on the extract+map:
out = (df
.assign(sort_value=df['B'].str.extract(r'(\w+)$', expand=False).map(sort_dicts))
.sort_values(by=['A', 'sort_value'])
.drop(columns='sort_value')
)
NB. df['B'].str.extract(r'(\w+)$', expand=False) is equivalent to df['B'].str.split().str[-1] but usually faster.
In such cases, having multiple conditions with different sorting keys, I prefer to use numpy.lexsort+iloc, which avoids the need to assign a temporary column:
out = df.iloc[np.lexsort([df['B'].str.extract(r'(\w+)$', expand=False)
.map(sort_dicts),
df['A']])]
Alternatively, if you don't want to use lexsort, you could also use a dictionary of custom sorting keys, or keep the original values by default:
sorter = {'B': df['B'].str.extract(r'(\w+)$', expand=False).map(sort_dicts)}
out = df.sort_values(['A', 'B'], key=lambda x: sorter.get(x.name, x))
Output:
A B
4 Ankang Shaanxi Ankang Northeast
0 Ankang Shaanxi Ankang Southeast
12 Ankang Shaanxi Ankang Northwest
7 Ankang Shaanxi Ankang Southwest
11 Ankang Shaanxi Ankang Southwest Upper
1 Baoding Anguo Baoding Anguo Northeast
2 Baoding Anguo Baoding Anguo Southeast
9 Baoding Anguo Baoding Anguo Northwest
5 Baoding Anguo Baoding Anguo Southwest
6 Baoding Anguo Baoding Anguo Upper
3 Changsha Hunan Changsha Hunan Bright
10 Changsha Hunan Changsha Hunan Lower
8 Luoyang Henan Luoyang Henan Upper
Your issue in df.sort_values(by=['A', 'sort_value'], key=lambda x :x.map(sort_dicts)) is that you're passing a key that will be applied to both columns, which gives you an intermediate that cannot sort on A:
A sort_value
0 None 1.0
1 None 0.0
2 None 1.0
3 None NaN
4 None 0.0
5 None 3.0
6 None 4.0
7 None 3.0
8 None 4.0
9 None 2.0
10 None NaN
11 None 4.0
12 None 2.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With