I've got a huge dataframe (13 million rows) which stocks and stock prices in. I've indexed them using MultiIndex(['stock', 'date'])
, where date
has been parsed as a DateTime
.
This means I can select price data based upon stock easily df.loc['AAPL']
, and by specific day df.loc['AAPL'].loc['2015-05-05']
.
What I want to know is, what is the fastest and most elegant accessor for getting all stock prices on a particular day (i.e. if the indices were reversed).
Currently I'm using df.xs("2015-05-05", level=1)
- is that the right way? Is there a better/cleaner/faster way?
I think what you did is fine, but there are alternative ways also.
>>> df = pd.DataFrame({
'stock':np.repeat( ['AAPL','GOOG','YHOO'], 3 ),
'date':np.tile( pd.date_range('5/5/2015', periods=3, freq='D'), 3 ),
'price':(np.random.randn(9).cumsum() + 10) })
>>> df = df.set_index(['stock','date'])
price
stock date
AAPL 2015-05-05 8.538459
2015-05-06 9.330140
2015-05-07 8.968898
GOOG 2015-05-05 8.964389
2015-05-06 9.828230
2015-05-07 9.992985
YHOO 2015-05-05 9.929548
2015-05-06 9.330295
2015-05-07 10.676468
A slightly more standard way of using loc twice
>>> df.loc['AAPL'].loc['2015-05-05']
would be to do
>>> df.loc['AAPL','2015-05-05']
price 8.538459
Name: (AAPL, 2015-05-05 00:00:00), dtype: float64
And instead of xs
you could use an IndexSlice. I think for 2 levels xs
is easier, but IndexSlice might be better past 2 levels.
>>> idx=pd.IndexSlice
>>> df.loc[ idx[:,'2015-05-05'], : ]
price
stock date
AAPL 2015-05-05 8.538459
GOOG 2015-05-05 8.964389
YHOO 2015-05-05 9.929548
And to be honest, I think the absolute easiest way here is use either date or stock (or neither) as index and then most selections are very straightforward. For example, if you remove the index completely you can effortlessly select by date:
>>> df = df.reset_index()
>>> df[ df['date']=='2015-05-05' ]
index stock date price
0 0 AAPL 2015-05-05 8.538459
3 3 GOOG 2015-05-05 8.964389
6 6 YHOO 2015-05-05 9.929548
Doing some quickie timings with 3 stocks and 3000 dates (=9000 rows), I found that a simple boolean selection (no index) was about 35% faster than xs, and xs was about 35% faster than using IndexSlice. But see Jeff's comment below, you should expect the boolean selection to perform relative worse with more rows.
Of course, the best thing for you to do is test on your own data and see how it comes out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With