Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:

df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])

looks like

   id  value
0   1      3
1   1      6
2   1      3
3   2      2
4   2      1

Now I wish to obtain the following DataFrame:

   id  value
0   1      3
1   1      6
2   1      6
3   2      2
4   2      2

The problem is that when I do

df.groupby('id')['value'].rolling(1).max()

I get the same DataFrame back. And when I do

df.groupby('id')['value'].rolling(3).max()

I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?

like image 629
splinter Avatar asked May 07 '17 10:05

splinter


People also ask

What is Min_periods in rolling?

The min_periods argument specifies the minimum number of observations in the current window required to generate a rolling value; otherwise, the result is NaN .

What is Cummax in Python?

The cummax() method returns a DataFrame with the cumulative maximum values.

How does rolling work in pandas?

rolling() function provides the feature of rolling window calculations. The concept of rolling window calculation is most primarily used in signal processing and time-series data. In very simple words we take a window size of k at a time and perform some desired mathematical operation on it.

What does PD GroupBy do?

Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.


2 Answers

It looks like you need cummax() instead of .rolling(N).max()

In [29]: df['new'] = df.groupby('id').value.cummax()

In [30]: df
Out[30]:
   id  value  new
0   1      3    3
1   1      6    6
2   1      3    6
3   2      2    2
4   2      1    2

Timing (using brand new Pandas version 0.20.1):

In [3]: df = pd.concat([df] * 10**4, ignore_index=True)

In [4]: df.shape
Out[4]: (50000, 2)

In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop

In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop

NOTE: from Pandas 0.20.0 what's new

  • Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
like image 152
MaxU - stop WAR against UA Avatar answered Sep 25 '22 03:09

MaxU - stop WAR against UA


Using apply will be a tiny bit faster:

# Using apply  
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop

Other method:

df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop
like image 39
Andrew L Avatar answered Sep 26 '22 03:09

Andrew L