File under the general heading of "groupby is slow if you have a large number of groups"
nobs = 9999
df = DataFrame( { 'id' : np.arange(nobs) / 3,
'yr' : np.tile( np.array([2007,2008,2009]), nobs/3 ),
'val': np.random.randn(nobs) } )
df = df.sort(['id','yr'])
A = df.groupby('id').shift()
B = df.shift()
A is what I want, but it takes about 1.5 seconds here, and my actual use case has about 100x the observations. For reference, calculating A is about 1000x slower than calculating B.
Here's what A and B look like:
In [599]: A.head(6)
Out[599]:
val yr
0 NaN NaN
1 -0.839041 2007
2 -1.089094 2008
3 NaN NaN
4 -0.068383 2007
5 0.555293 2008
In [600]: B.head(6)
Out[600]:
id val yr
0 NaN NaN NaN
1 0 -0.839041 2007
2 0 -1.089094 2008
3 0 0.050604 2009
4 1 -0.068383 2007
5 1 0.555293 2008
I'd love a general solution to speeding up A, but absent that, a workaround would be great. As you can see, B is actually the same as A except the first value of each group is not really valid and needs to be converted to NaN. It could be done with groupby/rank but anything involving groupby seems to be slow, so I need a non-groupby method.
Is there some way to replicate the rank function via the sorting or indexing? It seems like the information must be embedded there, but I don't know how I'd extract it to a new variable.
(Edited to add the following)
Here's a solution from the link provided by Jeff below (original answer by HYRY). I've just altered it slightly to work with the example here. On my computer it runs at almost exactly the same speed as DSM's solution.
B.iloc[df.groupby('id').size().cumsum()[:-1]] = np.nan
This isn't the most elegant piece of code ever, but as a hack workaround for your case, how about something like:
def fast(df, col):
A = df.sort(col).shift()
last = A.iloc[-1].copy()
A.loc[A[col] != A[col].shift(-1)] = np.nan
A.iloc[-1] = last
return A.drop(col, axis=1)
def slow(df, col):
A = df.sort(col).groupby(col).shift()
return A
which gives
>>> %timeit s = slow(df, "id")
1 loops, best of 3: 2.09 s per loop
>>> %timeit f = fast(df, "id")
100 loops, best of 3: 3.51 ms per loop
>>> slow(df, "id").equals(fast(df, "id"))
True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With