Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Naturally sorting Pandas DataFrame

Tags:

I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?

from natsort import natsorted import pandas as pd  # An unsorted list of strings a = ['0hr', '128hr', '72hr', '48hr', '96hr'] # Sorted incorrectly b = sorted(a) # Naturally Sorted  c = natsorted(a)  # Use a as the index for a DataFrame df = pd.DataFrame(index=a) # Sorted Incorrectly df2 = df.sort() # Natsort doesn't seem to work df3 = natsorted(df)  print(a) print(b) print(c) print(df.index) print(df2.index) print(df3.index) 
like image 259
agf1997 Avatar asked Apr 11 '15 17:04

agf1997


People also ask

Can you sort a Pandas Dataframe?

Getting Started With Pandas Sort Methods. As a quick reminder, a DataFrame is a data structure with labeled axes for both rows and columns. You can sort a DataFrame by row or column value as well as by row or column index.

How many ways can you sort Pandas Dataframe?

All of the sorting methods available in Pandas fall under the following three categories: Sorting by index labels; Sorting by column values; Sorting by a combination of index labels and column values.

How do I sort a Pandas Dataframe by column values?

You can sort by column values in pandas DataFrame using sort_values() method. To specify the order, you have to use ascending boolean property; False for descending and True for ascending. By default, it is set to True.

What is Natsort in Python?

Description ¶ natsort(array &$array ): bool. This function implements a sort algorithm that orders alphanumeric strings in the way a human being would while maintaining key/value associations. This is described as a "natural ordering".


2 Answers

Now that pandas has support for key in both sort_values and sort_index you should now refer to this other answer and send all upvotes there as it is now the correct answer.

I will leave my answer here for people stuck on old pandas versions, or as a historical curiosity.


The accepted answer answers the question being asked. I'd like to also add how to use natsort on columns in a DataFrame, since that will be the next question asked.

In [1]: from pandas import DataFrame  In [2]: from natsort import natsorted, index_natsorted, order_by_index  In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])  In [4]: df Out[4]:           a   b 0hr     a5  b1 128hr   a1  b1 72hr   a10  b2 48hr    a2  b2 96hr   a12  b1 

As the accepted answer shows, sorting by the index is fairly straightforward:

In [5]: df.reindex(index=natsorted(df.index)) Out[5]:           a   b 0hr     a5  b1 48hr    a2  b2 72hr   a10  b2 96hr   a12  b1 128hr   a1  b1 

If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsort provides the convenience functions index_natsorted and order_by_index to do just that.

In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a))) Out[6]:           a   b 128hr   a1  b1 48hr    a2  b2 0hr     a5  b1 72hr   a10  b2 96hr   a12  b1  In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b))) Out[7]:           a   b 0hr     a5  b1 128hr   a1  b1 96hr   a12  b1 72hr   a10  b2 48hr    a2  b2 

If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip (or itertools.izip on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...

In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a)))) Out[8]:           a   b 128hr   a1  b1 0hr     a5  b1 96hr   a12  b1 48hr    a2  b2 72hr   a10  b2  In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index)))) Out[9]:           a   b 0hr     a5  b1 96hr   a12  b1 128hr   a1  b1 48hr    a2  b2 72hr   a10  b2 

Here is an alternate method using Categorical objects that I have been told by the pandas devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex which will allow this method to be used on an index.

In [1]: from pandas import DataFrame  In [2]: from natsort import natsorted  In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])  In [4]: df.a = df.a.astype('category')  In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)  In [6]: df.b = df.b.astype('category')  In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)  In [9]: df.sort('a') Out[9]:           a   b 128hr   a1  b1 48hr    a2  b2 0hr     a5  b1 72hr   a10  b2 96hr   a12  b1  In [10]: df.sort('b') Out[10]:           a   b 0hr     a5  b1 128hr   a1  b1 96hr   a12  b1 72hr   a10  b2 48hr    a2  b2  In [11]: df.sort(['b', 'a']) Out[11]:           a   b 128hr   a1  b1 0hr     a5  b1 96hr   a12  b1 48hr    a2  b2 72hr   a10  b2 

The Categorical object lets you define a sorting order for the DataFrame to use. The elements given when calling reorder_categories must be unique, hence the call to set for column "b".

I leave it to the user to decide if this is better than the reindex method or not, since it requires you to sort the column data independently before sorting within the DataFrame (although I imagine that second sort is rather efficient).


Full disclosure, I am the natsort author.

like image 141
SethMMorton Avatar answered Nov 03 '22 00:11

SethMMorton


Using sort_values for pandas >= 1.1.0

With the new key argument in DataFrame.sort_values, since pandas 1.1.0, we can directly sort a column without setting it as an index using natsort.natsort_keygen:

df = pd.DataFrame({     "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],     "value": [10, 20, 30, 40, 50] })      time  value 0    0hr     10 1  128hr     20 2   72hr     30 3   48hr     40 4   96hr     50 
from natsort import natsort_keygen  df.sort_values(     by="time",     key=natsort_keygen() )      time  value 0    0hr     10 3   48hr     40 2   72hr     30 4   96hr     50 1  128hr     20 
like image 26
Erfan Avatar answered Nov 03 '22 00:11

Erfan