Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster alternative to grouby/shift

Tags:

python

pandas

File under the general heading of "groupby is slow if you have a large number of groups"

nobs =   9999 

df = DataFrame( { 'id' : np.arange(nobs) / 3,
                  'yr' : np.tile( np.array([2007,2008,2009]), nobs/3 ),
                  'val': np.random.randn(nobs) } )

df = df.sort(['id','yr'])

A = df.groupby('id').shift()
B = df.shift()

A is what I want, but it takes about 1.5 seconds here, and my actual use case has about 100x the observations. For reference, calculating A is about 1000x slower than calculating B.

Here's what A and B look like:

In [599]: A.head(6)
Out[599]: 
        val    yr
0       NaN   NaN
1 -0.839041  2007
2 -1.089094  2008
3       NaN   NaN
4 -0.068383  2007
5  0.555293  2008

In [600]: B.head(6)
Out[600]: 
   id       val    yr
0 NaN       NaN   NaN
1   0 -0.839041  2007
2   0 -1.089094  2008
3   0  0.050604  2009
4   1 -0.068383  2007
5   1  0.555293  2008

I'd love a general solution to speeding up A, but absent that, a workaround would be great. As you can see, B is actually the same as A except the first value of each group is not really valid and needs to be converted to NaN. It could be done with groupby/rank but anything involving groupby seems to be slow, so I need a non-groupby method.

Is there some way to replicate the rank function via the sorting or indexing? It seems like the information must be embedded there, but I don't know how I'd extract it to a new variable.

(Edited to add the following)

Here's a solution from the link provided by Jeff below (original answer by HYRY). I've just altered it slightly to work with the example here. On my computer it runs at almost exactly the same speed as DSM's solution.

B.iloc[df.groupby('id').size().cumsum()[:-1]] = np.nan

like image 783
JohnE Avatar asked Feb 04 '26 10:02

JohnE


1 Answers

This isn't the most elegant piece of code ever, but as a hack workaround for your case, how about something like:

def fast(df, col):
    A = df.sort(col).shift()
    last = A.iloc[-1].copy()
    A.loc[A[col] != A[col].shift(-1)] = np.nan
    A.iloc[-1] = last
    return A.drop(col, axis=1)

def slow(df, col):
    A = df.sort(col).groupby(col).shift()
    return A

which gives

>>> %timeit s = slow(df, "id")
1 loops, best of 3: 2.09 s per loop
>>> %timeit f = fast(df, "id")
100 loops, best of 3: 3.51 ms per loop
>>> slow(df, "id").equals(fast(df, "id"))
True
like image 97
DSM Avatar answered Feb 06 '26 23:02

DSM