Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative count at a group level Python

I have a pandas dataframe like this :

df = pd.DataFrame([
        ['A', 1234, 20120201],
        ['A', 1134, 20120201],
        ['A', 1011, 20120201],
        ['A', 1123, 20121004],
        ['A', 1111, 20121004],
        ['A', 1224, 20121105],
        ['B', 1156, 20120403],
        ['B', 2345, 20120504],
        ['B', 4567, 20120504],
        ['B', 8796, 20120606]
    ], columns = ['company', 'invoice', 'date'])

The aim is to create a new column called 'TotalPaidInvoices' which counts number of invoices paid prior to each record.

I tried the following

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')

But instead of number of invoices what I get is number of company - date combinations prior to the current record.

Output :

df = pd.DataFrame(
    [
        ['A', 1234, 20120201, 0.0],
        ['A', 1134, 20120201, 0.0],
        ['A', 1011, 20120201, 0.0],
        ['A', 1123, 20121004, 1.0],
        ['A', 1111, 20121004, 1.0],
        ['A', 1224, 20121105, 2.0],
        ['B', 1156, 20120403, 0.0],
        ['B', 2345, 20120504, 1.0],
        ['B', 4567, 20120504, 1.0],
        ['B', 8796, 20120606, 2.0]
    ], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Expected output :

df = pd.DataFrame(
    [
        ['A', 1234, 20120201, 0.0],
        ['A', 1134, 20120201, 0.0],
        ['A', 1011, 20120201, 0.0],
        ['A', 1123, 20121004, 3.0],
        ['A', 1111, 20121004, 3.0],
        ['A', 1224, 20121105, 5.0],
        ['B', 1156, 20120403, 0.0],
        ['B', 2345, 20120504, 1.0],
        ['B', 4567, 20120504, 1.0],
        ['B', 8796, 20120606, 3.0]
    ], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Any suggestions to fix?

like image 909
Pavithra Chithambaranath Avatar asked Apr 03 '21 15:04

Pavithra Chithambaranath


People also ask

How do you count by group in Python?

Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values.

How do you find the cumulative sum of a DataFrame in Python?

Cumulative sum over a Pandas DataFrame or Series axisThe cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum. The index or the name of the axis. 0 is equivalent to None or 'index'.

How do you calculate cumulative percentage in python?

Cumulative Percentage is calculated by the mathematical formula of dividing the cumulative sum of the column by the mathematical sum of all the values and then multiplying the result by 100. This is also applicable in Pandas Data frames.

How does Pandas count sum and group by?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.


Video Answer


3 Answers

First, let's count the number of invoices paid on each day for each company:

tmp1 = df.groupby(['company', 'date']).size().rename('totalpaidinvoices')

Then for each company, we need to count how many invoices were paid prior to the current period. That's a job for cumsum:

tmp2 = tmp1.groupby('company').apply(lambda s: s.cumsum() - s)

And finally, merge the calculation with the original dataframe:

df.merge(tmp2, left_on=['company', 'date'], right_index=True)

If you prefer method chaining:

result = (
    df.groupby(['company', 'date'])
        .size()
        .groupby('company')
        .apply(lambda s: s.cumsum() - s)
        .to_frame('totalpaidinvoices')
        .merge(df, how='right', left_index=True, right_on=['company', 'date'])
)
like image 136
Code Different Avatar answered Nov 01 '22 09:11

Code Different


If your data is sorted, you can try:

df = df.merge(
    df.groupby(["company", "date"])
    .size()
    .groupby(level=0)
    .apply(lambda x: x.shift(1).fillna(0).cumsum())
    .reset_index(),
    on=["date", "company"],
).rename(columns={0: "totalpaidinvoices"})
print(df)

Prints:

  company  invoice      date  totalpaidinvoices
0       A     1234  20120201                0.0
1       A     1134  20120201                0.0
2       A     1011  20120201                0.0
3       A     1123  20121004                3.0
4       A     1111  20121004                3.0
5       A     1224  20121105                5.0
6       B     1156  20120403                0.0
7       B     2345  20120504                1.0
8       B     4567  20120504                1.0
9       B     8796  20120606                3.0
like image 30
Andrej Kesely Avatar answered Nov 01 '22 07:11

Andrej Kesely


I thought I was making it too complicated switching from cumcount to boolean indexing, but based on the other answers, it seems this is actually the most concise (and potentially efficient) solution:

for company in df.company.unique():

    df.loc[df.company==company, 'total_paid_invoices'] = df.date.apply(
        lambda x: df.loc[(df.date<x)&(df.company==company)].shape[0]
    )
like image 28
semblable Avatar answered Nov 01 '22 09:11

semblable