Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The right way to query a pandas MultiIndex

Tags:

pandas

I've got a huge dataframe (13 million rows) which stocks and stock prices in. I've indexed them using MultiIndex(['stock', 'date']), where date has been parsed as a DateTime.

This means I can select price data based upon stock easily df.loc['AAPL'], and by specific day df.loc['AAPL'].loc['2015-05-05'].

What I want to know is, what is the fastest and most elegant accessor for getting all stock prices on a particular day (i.e. if the indices were reversed).

Currently I'm using df.xs("2015-05-05", level=1) - is that the right way? Is there a better/cleaner/faster way?

like image 929
cjm2671 Avatar asked Mar 19 '16 12:03

cjm2671


Video Answer


1 Answers

I think what you did is fine, but there are alternative ways also.

>>> df = pd.DataFrame({ 
              'stock':np.repeat( ['AAPL','GOOG','YHOO'], 3 ),
              'date':np.tile( pd.date_range('5/5/2015', periods=3, freq='D'), 3 ),
              'price':(np.random.randn(9).cumsum() + 10) })

>>> df = df.set_index(['stock','date'])

                      price
stock date                 
AAPL  2015-05-05   8.538459
      2015-05-06   9.330140
      2015-05-07   8.968898
GOOG  2015-05-05   8.964389
      2015-05-06   9.828230
      2015-05-07   9.992985
YHOO  2015-05-05   9.929548
      2015-05-06   9.330295
      2015-05-07  10.676468

A slightly more standard way of using loc twice

>>> df.loc['AAPL'].loc['2015-05-05']

would be to do

>>> df.loc['AAPL','2015-05-05']

price    8.538459
Name: (AAPL, 2015-05-05 00:00:00), dtype: float64

And instead of xs you could use an IndexSlice. I think for 2 levels xs is easier, but IndexSlice might be better past 2 levels.

>>> idx=pd.IndexSlice

>>> df.loc[ idx[:,'2015-05-05'], : ]

                     price
stock date                
AAPL  2015-05-05  8.538459
GOOG  2015-05-05  8.964389
YHOO  2015-05-05  9.929548

And to be honest, I think the absolute easiest way here is use either date or stock (or neither) as index and then most selections are very straightforward. For example, if you remove the index completely you can effortlessly select by date:

>>> df = df.reset_index()
>>> df[ df['date']=='2015-05-05' ]

   index stock       date      price
0      0  AAPL 2015-05-05   8.538459
3      3  GOOG 2015-05-05   8.964389
6      6  YHOO 2015-05-05   9.929548

Doing some quickie timings with 3 stocks and 3000 dates (=9000 rows), I found that a simple boolean selection (no index) was about 35% faster than xs, and xs was about 35% faster than using IndexSlice. But see Jeff's comment below, you should expect the boolean selection to perform relative worse with more rows.

Of course, the best thing for you to do is test on your own data and see how it comes out.

like image 75
JohnE Avatar answered Oct 15 '22 09:10

JohnE