Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to shift MultiIndex time series

Tags:

pandas

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?

Right now, I do something like:

cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))

For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)

I'm converting this from R code and the equivalent data.table call:

dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]

only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)

Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.

Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:

ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))

That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.

Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.

library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
like image 828
Johann Hibschman Avatar asked Nov 01 '12 15:11

Johann Hibschman


1 Answers

I would suggest you reshape the data and do a single shift versus the groupby approach:

result = df.unstack(0).shift(1).stack()

This switches the order of the levels so you'd want to swap and reorder:

result = result.swaplevel(0, 1).sortlevel(0)

You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):

In [17]: result.ix[1]
Out[17]: 
              a         b         c         d         e
month                                                  
1      0.752511  0.600825  0.328796  0.852869  0.306379
2      0.251120  0.871167  0.977606  0.509303  0.809407
3      0.198327  0.587066  0.778885  0.565666  0.172045
4      0.298184  0.853896  0.164485  0.169562  0.923817
5      0.703668  0.852304  0.030534  0.415467  0.663602
6      0.851866  0.629567  0.918303  0.205008  0.970033
7      0.758121  0.066677  0.433014  0.005454  0.338596
8      0.561382  0.968078  0.586736  0.817569  0.842106
9      0.246986  0.829720  0.522371  0.854840  0.887886
10     0.709550  0.591733  0.919168  0.568988  0.849380
11     0.997787  0.084709  0.664845  0.808106  0.872628
12     0.008661  0.449826  0.841896  0.307360  0.092581
13     0.727409  0.791167  0.518371  0.691875  0.095718
14     0.928342  0.247725  0.754204  0.468484  0.663773
15     0.934902  0.692837  0.367644  0.061359  0.381885
16     0.828492  0.026166  0.050765  0.524551  0.296122
17     0.589907  0.775721  0.061765  0.033213  0.793401
18     0.532189  0.678184  0.747391  0.199283  0.349949

In [18]: df.ix[1]
Out[18]: 
              a         b         c         d         e
month                                                  
0      0.752511  0.600825  0.328796  0.852869  0.306379
1      0.251120  0.871167  0.977606  0.509303  0.809407
2      0.198327  0.587066  0.778885  0.565666  0.172045
3      0.298184  0.853896  0.164485  0.169562  0.923817
4      0.703668  0.852304  0.030534  0.415467  0.663602
5      0.851866  0.629567  0.918303  0.205008  0.970033
6      0.758121  0.066677  0.433014  0.005454  0.338596
7      0.561382  0.968078  0.586736  0.817569  0.842106
8      0.246986  0.829720  0.522371  0.854840  0.887886
9      0.709550  0.591733  0.919168  0.568988  0.849380
10     0.997787  0.084709  0.664845  0.808106  0.872628
11     0.008661  0.449826  0.841896  0.307360  0.092581
12     0.727409  0.791167  0.518371  0.691875  0.095718
13     0.928342  0.247725  0.754204  0.468484  0.663773
14     0.934902  0.692837  0.367644  0.061359  0.381885
15     0.828492  0.026166  0.050765  0.524551  0.296122
16     0.589907  0.775721  0.061765  0.033213  0.793401
17     0.532189  0.678184  0.747391  0.199283  0.349949

Perf isn't too bad with this method (it might be a touch slower in 0.9.0):

In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s
like image 98
Wes McKinney Avatar answered Nov 03 '22 18:11

Wes McKinney