Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate cumulative groupby counts in Pandas with point in time?

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.

the df looks like this:

pointInTime   ticketId
2008-01-01         111
2008-01-01         222
2008-01-01         333
2008-01-07         444
2008-01-07         555
2008-01-07         666
2008-01-14         777
2008-01-14         888
2008-01-14         999

So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.

and have a df looks like this:

pointInTime   ticketId   cumCount
2008-01-01         111   3
2008-01-01         222   3
2008-01-01         333   3
2008-01-07         444   6
2008-01-07         555   6
2008-01-07         666   6
2008-01-14         777   9
2008-01-14         888   9
2008-01-14         999   9

so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.

like image 825
bossangelo Avatar asked Jun 18 '19 14:06

bossangelo


People also ask

How do you count after Groupby in pandas?

Using pandas groupby count() You can also use the pandas groupby count() function which gives the “count” of values in each column for each group. For example, let's group the dataframe df on the “Team” column and apply the count() function. We get a dataframe of counts of values for each group and each column.

How do you get the cumulative sum of a column in pandas?

The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

How do I count the number of rows in each group of a Groupby object?

You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.


3 Answers

Use GroupBy.count and cumsum, then map the result back to "pointInTime":

df['cumCount'] = (
    df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df

  pointInTime  ticketId  cumCount
0  2008-01-01       111         3
1  2008-01-01       222         3
2  2008-01-01       333         3
3  2008-01-07       444         6
4  2008-01-07       555         6
5  2008-01-07       666         6
6  2008-01-14       777         9
7  2008-01-14       888         9
8  2008-01-14       999         9
like image 194
cs95 Avatar answered Nov 15 '22 04:11

cs95


I am using value_counts

df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]: 
0    3
1    3
2    3
3    6
4    6
5    6
6    9
7    9
8    9
Name: pointInTime, dtype: int64

Or

pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]: 
0    3
1    3
2    3
3    6
4    6
5    6
6    9
7    9
8    9
dtype: int32
like image 40
BENY Avatar answered Nov 15 '22 04:11

BENY


Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:

df['cumCount'] = (df.groupby('pointInTime').ticketId
                    .transform('size')
                    .mul(pd.factorize(df.pointInTime)[0]+1))

 pointInTime  ticketId  cumCount
0  2008-01-01       111         3
1  2008-01-01       222         3
2  2008-01-01       333         3
3  2008-01-07       444         6
4  2008-01-07       555         6
5  2008-01-07       666         6
6  2008-01-14       777         9
7  2008-01-14       888         9
8  2008-01-14       999         9
like image 37
yatu Avatar answered Nov 15 '22 02:11

yatu