I'm trying to understand the expected behavior of DataFrame.sort on columns with NaN values.
Given this DataFrame:
In [36]: df
Out[36]:
a b
0 1 9
1 2 NaN
2 NaN 5
3 1 2
4 6 5
5 8 4
6 4 5
Sorting using one column puts the NaN at the end, as expected:
In [37]: df.sort(columns="a")
Out[37]:
a b
0 1 9
3 1 2
1 2 NaN
6 4 5
4 6 5
5 8 4
2 NaN 5
But nested sort doesn't behave as I would expect, leaving the NaN unsorted:
In [38]: df.sort(columns=["a","b"])
Out[38]:
a b
3 1 2
0 1 9
1 2 NaN
2 NaN 5
6 4 5
4 6 5
5 8 4
Is there a way to make sure the NaNs in nested sort will appear at the end, per column?
Until fixed in Pandas, this is what I'm using for sorting for my needs, with a subset of the functionality of the original DataFrame.sort function. This will work for numerical values only:
def dataframe_sort(df, columns, ascending=True):
a = np.array(df[columns])
# ascending/descending array - -1 if descending, 1 if ascending
if isinstance(ascending, bool):
ascending = len(columns) * [ascending]
ascending = map(lambda x: x and 1 or -1, ascending)
ind = np.lexsort([ascending[i] * a[:, i] for i in reversed(range(len(columns)))])
return df.iloc[[ind]]
Usage example:
In [4]: df
Out[4]:
a b c
10 1 9 7
11 NaN NaN 1
12 2 NaN 6
13 NaN 5 6
14 1 2 6
15 6 5 NaN
16 8 4 4
17 4 5 3
In [5]: dataframe_sort(df, ['a', 'c'], False)
Out[5]:
a b c
16 8 4 4
15 6 5 NaN
17 4 5 3
12 2 NaN 6
10 1 9 7
14 1 2 6
13 NaN 5 6
11 NaN NaN 1
In [6]: dataframe_sort(df, ['b', 'a'], [False, True])
Out[6]:
a b c
10 1 9 7
17 4 5 3
15 6 5 NaN
13 NaN 5 6
16 8 4 4
14 1 2 6
12 2 NaN 6
11 NaN NaN 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With