I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally? <pre class="prettyprint"><code>from natsort import natsorted import pandas as pd # An unsorted list of strings a = ['0hr', '128hr', '72hr', '48hr', '96hr'] # Sorted incorrectly b = sorted(a) # Naturally Sorted c = natsorted(a) # Use a as the index for a DataFrame df = pd.DataFrame(index=a) # Sorted Incorrectly df2 = df.sort() # Natsort doesn't seem to work df3 = natsorted(df) print(a) print(b) print(c) print(df.index) print(df2.index) print(df3.index) </code></pre>

<h3>Now that <code>pandas</code> has support for <code>key</code> in both <code>sort_values</code> and <code>sort_index</code> you should now refer to this other answer and send all upvotes there as it is now the correct answer.</h3> I will leave my answer here for people stuck on old <code>pandas</code> versions, or as a historical curiosity. <hr> The accepted answer answers the question being asked. I'd like to also add how to use <code>natsort</code> on columns in a <code>DataFrame</code>, since that will be the next question asked. <pre class="prettyprint"><code>In [1]: from pandas import DataFrame In [2]: from natsort import natsorted, index_natsorted, order_by_index In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr']) In [4]: df Out[4]: a b 0hr a5 b1 128hr a1 b1 72hr a10 b2 48hr a2 b2 96hr a12 b1 </code></pre> As the accepted answer shows, sorting by the index is fairly straightforward: <pre class="prettyprint"><code>In [5]: df.reindex(index=natsorted(df.index)) Out[5]: a b 0hr a5 b1 48hr a2 b2 72hr a10 b2 96hr a12 b1 128hr a1 b1 </code></pre> If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. <code>natsort</code> provides the convenience functions <code>index_natsorted</code> and <code>order_by_index</code> to do just that. <pre class="prettyprint"><code>In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a))) Out[6]: a b 128hr a1 b1 48hr a2 b2 0hr a5 b1 72hr a10 b2 96hr a12 b1 In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b))) Out[7]: a b 0hr a5 b1 128hr a1 b1 96hr a12 b1 72hr a10 b2 48hr a2 b2 </code></pre> If you want to reorder by an arbitrary number of columns (or a column and the index), you can use <code>zip</code> (or <code>itertools.izip</code> on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc... <pre class="prettyprint"><code>In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a)))) Out[8]: a b 128hr a1 b1 0hr a5 b1 96hr a12 b1 48hr a2 b2 72hr a10 b2 In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index)))) Out[9]: a b 0hr a5 b1 96hr a12 b1 128hr a1 b1 48hr a2 b2 72hr a10 b2 </code></pre> <hr> Here is an alternate method using <code>Categorical</code> objects that I have been told by the <code>pandas</code> devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add <code>CategoricalIndex</code> which will allow this method to be used on an index. <pre class="prettyprint"><code>In [1]: from pandas import DataFrame In [2]: from natsort import natsorted In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr']) In [4]: df.a = df.a.astype('category') In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True) In [6]: df.b = df.b.astype('category') In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True) In [9]: df.sort('a') Out[9]: a b 128hr a1 b1 48hr a2 b2 0hr a5 b1 72hr a10 b2 96hr a12 b1 In [10]: df.sort('b') Out[10]: a b 0hr a5 b1 128hr a1 b1 96hr a12 b1 72hr a10 b2 48hr a2 b2 In [11]: df.sort(['b', 'a']) Out[11]: a b 128hr a1 b1 0hr a5 b1 96hr a12 b1 48hr a2 b2 72hr a10 b2 </code></pre> The <code>Categorical</code> object lets you define a sorting order for the <code>DataFrame</code> to use. The elements given when calling <code>reorder_categories</code> must be unique, hence the call to <code>set</code> for column "b". I leave it to the user to decide if this is better than the <code>reindex</code> method or not, since it requires you to sort the column data independently before sorting within the <code>DataFrame</code> (although I imagine that second sort is rather efficient). <hr> Full disclosure, I am the <code>natsort</code> author.

<h3>Using <code>sort_values</code> for <code>pandas >= 1.1.0</code> </h3> With the new <code>key</code> argument in <code>DataFrame.sort_values</code>, since <code>pandas 1.1.0</code>, we can directly sort a column without setting it as an index using <code>natsort.natsort_keygen</code>: <pre class="prettyprint"><code>df = pd.DataFrame({ "time": ['0hr', '128hr', '72hr', '48hr', '96hr'], "value": [10, 20, 30, 40, 50] }) time value 0 0hr 10 1 128hr 20 2 72hr 30 3 48hr 40 4 96hr 50 </code></pre> <pre class="prettyprint"><code>from natsort import natsort_keygen df.sort_values( by="time", key=natsort_keygen() ) time value 0 0hr 10 3 48hr 40 2 72hr 30 4 96hr 50 1 128hr 20 </code></pre>

Naturally sorting Pandas DataFrame

Tags:

I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?

Click to copy

from natsort import natsorted import pandas as pd  # An unsorted list of strings a = ['0hr', '128hr', '72hr', '48hr', '96hr'] # Sorted incorrectly b = sorted(a) # Naturally Sorted  c = natsorted(a)  # Use a as the index for a DataFrame df = pd.DataFrame(index=a) # Sorted Incorrectly df2 = df.sort() # Natsort doesn't seem to work df3 = natsorted(df)  print(a) print(b) print(c) print(df.index) print(df2.index) print(df3.index)

259

asked Apr 11 '15 17:04

agf1997

2 Answers

Now that `pandas` has support for `key` in both `sort_values` and `sort_index` you should now refer to this other answer and send all upvotes there as it is now the correct answer.

I will leave my answer here for people stuck on old pandas versions, or as a historical curiosity.

The accepted answer answers the question being asked. I'd like to also add how to use natsort on columns in a DataFrame, since that will be the next question asked.

Click to copy

In [1]: from pandas import DataFrame  In [2]: from natsort import natsorted, index_natsorted, order_by_index  In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])  In [4]: df Out[4]:           a   b 0hr     a5  b1 128hr   a1  b1 72hr   a10  b2 48hr    a2  b2 96hr   a12  b1

As the accepted answer shows, sorting by the index is fairly straightforward:

Click to copy

In [5]: df.reindex(index=natsorted(df.index)) Out[5]:           a   b 0hr     a5  b1 48hr    a2  b2 72hr   a10  b2 96hr   a12  b1 128hr   a1  b1

If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsort provides the convenience functions index_natsorted and order_by_index to do just that.

Click to copy

In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a))) Out[6]:           a   b 128hr   a1  b1 48hr    a2  b2 0hr     a5  b1 72hr   a10  b2 96hr   a12  b1  In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b))) Out[7]:           a   b 0hr     a5  b1 128hr   a1  b1 96hr   a12  b1 72hr   a10  b2 48hr    a2  b2

If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip (or itertools.izip on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...

Click to copy

In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a)))) Out[8]:           a   b 128hr   a1  b1 0hr     a5  b1 96hr   a12  b1 48hr    a2  b2 72hr   a10  b2  In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index)))) Out[9]:           a   b 0hr     a5  b1 96hr   a12  b1 128hr   a1  b1 48hr    a2  b2 72hr   a10  b2

Here is an alternate method using Categorical objects that I have been told by the pandas devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex which will allow this method to be used on an index.

Click to copy

In [1]: from pandas import DataFrame  In [2]: from natsort import natsorted  In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])  In [4]: df.a = df.a.astype('category')  In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)  In [6]: df.b = df.b.astype('category')  In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)  In [9]: df.sort('a') Out[9]:           a   b 128hr   a1  b1 48hr    a2  b2 0hr     a5  b1 72hr   a10  b2 96hr   a12  b1  In [10]: df.sort('b') Out[10]:           a   b 0hr     a5  b1 128hr   a1  b1 96hr   a12  b1 72hr   a10  b2 48hr    a2  b2  In [11]: df.sort(['b', 'a']) Out[11]:           a   b 128hr   a1  b1 0hr     a5  b1 96hr   a12  b1 48hr    a2  b2 72hr   a10  b2

The Categorical object lets you define a sorting order for the DataFrame to use. The elements given when calling reorder_categories must be unique, hence the call to set for column "b".

I leave it to the user to decide if this is better than the reindex method or not, since it requires you to sort the column data independently before sorting within the DataFrame (although I imagine that second sort is rather efficient).

Full disclosure, I am the natsort author.

141

answered Nov 03 '22 00:11

SethMMorton

Using `sort_values` for `pandas >= 1.1.0`

With the new key argument in DataFrame.sort_values, since pandas 1.1.0, we can directly sort a column without setting it as an index using natsort.natsort_keygen:

Click to copy

df = pd.DataFrame({     "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],     "value": [10, 20, 30, 40, 50] })      time  value 0    0hr     10 1  128hr     20 2   72hr     30 3   48hr     40 4   96hr     50

Click to copy

from natsort import natsort_keygen  df.sort_values(     by="time",     key=natsort_keygen() )      time  value 0    0hr     10 3   48hr     40 2   72hr     30 4   96hr     50 1  128hr     20

answered Nov 03 '22 00:11

Erfan

Related questions
                            
                                How can I pass a struct to a function as parameter?
                            
                                Paging with Entity Framework 7 and SQL Server 2008
                            
                                Node cron, run every midnight
                            
                                What is the difference between "range(0,2)" and "list(range(0,2))"?
                            
                                R - install_github fails
                            
                                How to remove border from Textbox in WPF?
                            
                                Angular 2.0 and ng-style
                            
                                Web app in tvOS
                            
                                React-native Awesome project not building android project
                            
                                How can I specify where my local developer's service fabric cluster is created?
                            
                                Split by comma and how to exclude comma from quotes in split ... Python
                            
                                Is there a LIKE operator in odata filter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Naturally sorting Pandas DataFrame

Tags:

agf1997

People also ask

2 Answers

Now that `pandas` has support for `key` in both `sort_values` and `sort_index` you should now refer to this other answer and send all upvotes there as it is now the correct answer.

SethMMorton

Using `sort_values` for `pandas >= 1.1.0`

Erfan

Recent Activity

Donate For Us

Naturally sorting Pandas DataFrame

Tags:

agf1997

People also ask

2 Answers

Now that pandas has support for key in both sort_values and sort_index you should now refer to this other answer and send all upvotes there as it is now the correct answer.

SethMMorton

Using sort_values for pandas >= 1.1.0

Erfan

Related questions

Recent Activity

Donate For Us

Now that `pandas` has support for `key` in both `sort_values` and `sort_index` you should now refer to this other answer and send all upvotes there as it is now the correct answer.

Using `sort_values` for `pandas >= 1.1.0`