Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: GroupBy Shift And Cumulative Sum

I want to do groupby, shift and cumsum which seems pretty trivial task but still banging my head over the result I'm getting. Can someone please tell what am I doing wrong. All the results I found online shows the same or the same variation of what I am doing. Below is my implementation.

temp = pd.DataFrame(data=[['a',1],['a',1],['a',1],['b',1],['b',1],['b',1],['c',1],['c',1]], columns=['ID','X'])

temp['transformed'] = temp.groupby('ID')['X'].cumsum().shift()
print(temp)

   ID   X   transformed
0   a   1   NaN
1   a   1   1.0
2   a   1   2.0
3   b   1   3.0
4   b   1   1.0
5   b   1   2.0
6   c   1   3.0
7   c   1   1.0

This is wrong because the actual or what I am looking for is as below:

   ID   X   transformed
0   a   1   NaN
1   a   1   1.0
2   a   1   2.0
3   b   1   NaN
4   b   1   1.0
5   b   1   2.0
6   c   1   NaN
7   c   1   1.0

Thanks a lot in advance.

like image 490
Krishnang K Dalal Avatar asked Mar 04 '19 23:03

Krishnang K Dalal


People also ask

How do you do a cumulative sum in pandas?

Pandas Series: cumsum() function The cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum. The index or the name of the axis. 0 is equivalent to None or 'index'.

Does pandas groupby keep order?

Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

Can you use groupby with multiple columns in pandas?

Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.

How does pandas groupby apply work?

The GroupBy function in Pandas employs the split-apply-combine strategy meaning it performs a combination of — splitting an object, applying functions to the object and combining the results.


2 Answers

While working on this problem, as the DataFrame size grows, using lambdas on transform starts to get very slow. I found out that using some DataFrameGroupBy methods (like cumsum and shift instead of lambdas are much faster.

So here's my proposed solution, creating a 'temp' column to save the cumsum for each ID and then shifting in a different groupby:

df['temp'] = df.groupby("ID")['X'].cumsum()
df['transformed'] = df.groupby("ID")['temp'].shift()
df = df.drop(columns=["temp"])
like image 125
Kazu Avatar answered Sep 18 '22 11:09

Kazu


You could use transform() to feed the separate groups that are created at each level of groupby into the cumsum() and shift() methods.

temp['transformed'] = \
    temp.groupby('ID')['X'].transform(lambda x: x.cumsum().shift())
  ID  X   transformed
0  a  1   NaN
1  a  1   1.0
2  a  1   2.0
3  b  1   NaN
4  b  1   1.0
5  b  1   2.0
6  c  1   NaN
7  c  1   1.0

For more info on transform() please see here:

  • https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#Transformation
  • https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#transformation
like image 25
leerssej Avatar answered Sep 17 '22 11:09

leerssej