Sorting numpy array on multiple columns in Python

Tags:

I am trying to sort the following array on column1, then column2 and then column3

[['2008' '1' '23' 'AAPL' 'Buy' '100']
 ['2008' '1' '30' 'AAPL' 'Sell' '100']
 ['2008' '1' '23' 'GOOG' 'Buy' '100']
 ['2008' '1' '30' 'GOOG' 'Sell' '100']
 ['2008' '9' '8' 'GOOG' 'Buy' '100']
 ['2008' '9' '15' 'GOOG' 'Sell' '100']
 ['2008' '5' '1' 'XOM' 'Buy' '100']
 ['2008' '5' '8' 'XOM' 'Sell' '100']]

I used the following code:

    idx=np.lexsort((order_array[:,2],order_array[:,1],order_array[:,0]))
    order_array=order_array[idx]

The resultant array is

[['2008' '1' '23' 'AAPL' 'Buy' '100']
 ['2008' '1' '23' 'GOOG' 'Buy' '100']
 ['2008' '1' '30' 'AAPL' 'Sell' '100']
 ['2008' '1' '30' 'GOOG' 'Sell' '100']
 ['2008' '5' '1' 'XOM' 'Buy' '100']
 ['2008' '5' '8' 'XOM' 'Sell' '100']
 ['2008' '9' '15' 'GOOG' 'Sell' '100']
 ['2008' '9' '8' 'GOOG' 'Buy' '100']]

The problem is that the last two rows are wrong. The correct array should have the last row as the second last one. I have tried everything but am not able to understand why this is happening. Will appreciate some help.

I am using the following code for obtaining order_array.

 for i in ….
    x= ldt_timestamps[i] # this is a list of timestamps
    s_sym=……
    list=[int(x.year),int(x.month),int(x.day),s_sym,'Buy',100]   
    rows_list.append(list) 

 order_array=np.array(rows_list)

279

asked Oct 03 '13 10:10

user2842122

1 Answers

tldr: NumPy shines when doing numerical calculations on numerical arrays. Although it is possible (see below) NumPy is not well suited for this. You're probably better off using Pandas.

The cause of the problem:

The values are being sorted as strings. You need to sort them as ints.

In [7]: sorted(['15', '8'])
Out[7]: ['15', '8']

In [8]: sorted([15, 8])
Out[8]: [8, 15]

This happened because order_array contains strings. You need to convert those strings to ints where appropriate.

Converting dtypes from string-dtype to numerical dtype requires allocating space for a new array. Therefore, you would probably be better off revising the way you are creating order_array from the beginning.

Interestingly, even though you converted the values to ints, when you call

order_array = np.array(rows_list)

NumPy by default creates a homogenous array. In a homogeneous array every value has a same dtype. So NumPy tried to find the common denominator among all your values and chose a string dtype, thwarting the effort you put into converting the strings to ints!

You can check the dtype for yourself by inspecting order_array.dtype:

In [42]: order_array = np.array(rows_list)

In [43]: order_array.dtype
Out[43]: dtype('|S4')

Now, how do we fix this?

Using an object dtype:

The simplest way is to use an 'object' dtype

In [53]: order_array = np.array(rows_list, dtype='object')

In [54]: order_array
Out[54]: 
array([[2008, 1, 23, AAPL, Buy, 100],
       [2008, 1, 30, AAPL, Sell, 100],
       [2008, 1, 23, GOOG, Buy, 100],
       [2008, 1, 30, GOOG, Sell, 100],
       [2008, 9, 8, GOOG, Buy, 100],
       [2008, 9, 15, GOOG, Sell, 100],
       [2008, 5, 1, XOM, Buy, 100],
       [2008, 5, 8, XOM, Sell, 100]], dtype=object)

The problem here is that np.lexsort or np.sort do not work on arrays of dtype object. To get around that problem, you could sort the rows_list before creating order_list:

In [59]: import operator

In [60]: rows_list.sort(key=operator.itemgetter(0,1,2))
Out[60]: 
[(2008, 1, 23, 'AAPL', 'Buy', 100),
 (2008, 1, 23, 'GOOG', 'Buy', 100),
 (2008, 1, 30, 'AAPL', 'Sell', 100),
 (2008, 1, 30, 'GOOG', 'Sell', 100),
 (2008, 5, 1, 'XOM', 'Buy', 100),
 (2008, 5, 8, 'XOM', 'Sell', 100),
 (2008, 9, 8, 'GOOG', 'Buy', 100),
 (2008, 9, 15, 'GOOG', 'Sell', 100)]

order_array = np.array(rows_list, dtype='object')

A better option would be to combine the first three columns into datetime.date objects:

import operator
import datetime as DT

for i in ...:
    seq = [DT.date(int(x.year), int(x.month), int(x.day)) ,s_sym, 'Buy', 100]   
    rows_list.append(seq)
rows_list.sort(key=operator.itemgetter(0,1,2))        
order_array = np.array(rows_list, dtype='object')

In [72]: order_array
Out[72]: 
array([[2008-01-23, AAPL, Buy, 100],
       [2008-01-30, AAPL, Sell, 100],
       [2008-01-23, GOOG, Buy, 100],
       [2008-01-30, GOOG, Sell, 100],
       [2008-09-08, GOOG, Buy, 100],
       [2008-09-15, GOOG, Sell, 100],
       [2008-05-01, XOM, Buy, 100],
       [2008-05-08, XOM, Sell, 100]], dtype=object)

Even though this is simple, I don't like NumPy arrays of dtype object. You get neither the speed nor the memory space-saving gains of NumPy arrays with native dtypes. At this point you might find working with a Python list of lists faster and syntactically easier to deal with.

Using a structured array:

A more NumPy-ish solution which still offers speed and memory benefits is to use a structured array (as opposed to homogeneous array). To make a structured array with np.array you'll need to supply a dtype explicitly:

dt = [('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'),
      ('action', '|S4'), ('value', '<i4')]
order_array = np.array(rows_list, dtype=dt)

In [47]: order_array.dtype
Out[47]: dtype([('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'), ('action', '|S4'), ('value', '<i4')])

To sort the structured array you could use the sort method:

order_array.sort(order=['year', 'month', 'day'])

To work with structured arrays, you'll need to know about some differences between homogenous and structured arrays:

Your original homogenous array was 2-dimensional. In contrast, all structured arrays are 1-dimensional:

In [51]: order_array.shape
Out[51]: (8,)

If you index the structured array with an int or iterate through the array, you get back rows:

In [52]: order_array[3]
Out[52]: (2008, 1, 30, 'GOOG', 'Sell', 100)

With homogeneous arrays you can access the columns with order_array[:, i] Now, with a structured array, you access them by name: e.g. order_array['year'].

Or, use Pandas:

If you can install Pandas, I think you might be happiest working with a Pandas DataFrame:

In [73]: df = pd.DataFrame(rows_list, columns=['date', 'symbol', 'action', 'value'])
In [75]: df.sort(['date'])
Out[75]: 
         date symbol action  value
0  2008-01-23   AAPL    Buy    100
2  2008-01-23   GOOG    Buy    100
1  2008-01-30   AAPL   Sell    100
3  2008-01-30   GOOG   Sell    100
6  2008-05-01    XOM    Buy    100
7  2008-05-08    XOM   Sell    100
4  2008-09-08   GOOG    Buy    100
5  2008-09-15   GOOG   Sell    100

Pandas has useful functions for aligning timeseries by dates, filling in missing values, grouping and aggregating/transforming rows or columns.

Typically it is more useful to have a single date column instead of three integer-valued columns for the year, month, day.

If you need the year, month, day as separate columns for the purpose of outputing, to say csv, then you can replace the date column with year, month, day columns like this:

In [33]: df = df.join(df['date'].apply(lambda x: pd.Series([x.year, x.month, x.day], index=['year', 'month', 'day'])))

In [34]: del df['date']

In [35]: df
Out[35]: 
  symbol action  value  year  month  day
0   AAPL    Buy    100  2008      1   23
1   GOOG    Buy    100  2008      1   23
2   AAPL   Sell    100  2008      1   30
3   GOOG   Sell    100  2008      1   30
4    XOM    Buy    100  2008      5    1
5    XOM   Sell    100  2008      5    8
6   GOOG    Buy    100  2008      9    8
7   GOOG   Sell    100  2008      9   15

Or, if you have no use for the 'date' column to begin with, you can of course leave rows_list alone and build the DataFrame with the year, month, day columns from the beginning. Sorting is still easy:

df.sort(['year', 'month', 'day'])

115

answered Oct 08 '22 00:10

unutbu

Related questions
                            
                                Python, "filtered" line editing, read stdin by char with no echo
                            
                                Logging to specific error log file in scrapy
                            
                                Drag and Drop in Tkinter?
                            
                                How to instantiate a template method of a template class with swig?
                            
                                How to send JavaScript and Cookies Enabled in Scrapy?
                            
                                Send Apple Notification Service A Message With Python
                            
                                How to physically print python code in color from IDLE?
                            
                                Why is Python 2.7 installed at root, unlike most programs today?
                            
                                Hiding major tick labels while showing minor tick labels in matplotlib
                            
                                Python tkinter label orientation
                            
                                Recursively build hierarchical JSON tree?
                            
                                Dot-slash not recognized in command prompt - Trying to install Python module
                            
                                Python decorator function called at compile time
                            
                                Using compression with Pandas and HD5 / HDFStore
                            
                                What is correct: widget.rowconfigure or widget.grid_rowconfigure?
                            
                                Pass tuple as input argument for scipy.optimize.curve_fit
                            
                                requests: disable auto decoding
                            
                                Windowed maximum in numpy
                            
                                get list of named loglevels
                            
                                Sum over squared array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sorting numpy array on multiple columns in Python

Tags:

python

sorting

numpy

user2842122

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us