I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A   B   C   D
3   4   8   1
9   2   7   2
Needs to become:
A   B   C   D
8   4   3   1
9   7   2   2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
You can sort by column values in pandas DataFrame using sort_values() method. To specify the order, you have to use ascending boolean property; False for descending and True for ascending. By default, it is set to True.
In order to sort the data frame in pandas, function sort_values() is used. Pandas sort_values() can sort the data frame in Ascending or Descending order.
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1)  # no ascending argument
In [13]: a = a[:, ::-1]  # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
       [9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
   A  B  C  D
0  8  4  3  1
1  9  7  2  2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
   D  C  B  A
0  1  8  4  3
1  2  7  2  9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
To Add to the answer given by @Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
    In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
    In [98]: A
    Out[98]: 
    one  two  three  four  five
    0   22   63     72    46    49
    1   43   30     69    33    25
    2   93   24     21    56    39
    3    3   57     52    11    74
    In [99]: A.values.sort
    Out[99]: <function ndarray.sort>
    In [100]: A
    Out[100]: 
    one  two  three  four  five
    0   22   63     72    46    49
    1   43   30     69    33    25
    2   93   24     21    56    39
    3    3   57     52    11    74
    In [101]: A.values.sort()
    In [102]: A
    Out[102]: 
    one  two  three  four  five
    0   22   46     49    63    72
    1   25   30     33    43    69
    2   21   24     39    56    93
    3    3   11     52    57    74
    In [103]: A = A.iloc[:,::-1]
    In [104]: A
    Out[104]: 
    five  four  three  two  one
    0    72    63     49   46   22
    1    69    43     33   30   25
    2    93    56     39   24   21
    3    74    57     52   11    3
I hope someone can explain the why of this, just happy that it works 8)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With