I have a pandas dataframe like this :
df = pd.DataFrame([
['A', 1234, 20120201],
['A', 1134, 20120201],
['A', 1011, 20120201],
['A', 1123, 20121004],
['A', 1111, 20121004],
['A', 1224, 20121105],
['B', 1156, 20120403],
['B', 2345, 20120504],
['B', 4567, 20120504],
['B', 8796, 20120606]
], columns = ['company', 'invoice', 'date'])
The aim is to create a new column called 'TotalPaidInvoices' which counts number of invoices paid prior to each record.
I tried the following
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')
But instead of number of invoices what I get is number of company - date combinations prior to the current record.
Output :
df = pd.DataFrame(
[
['A', 1234, 20120201, 0.0],
['A', 1134, 20120201, 0.0],
['A', 1011, 20120201, 0.0],
['A', 1123, 20121004, 1.0],
['A', 1111, 20121004, 1.0],
['A', 1224, 20121105, 2.0],
['B', 1156, 20120403, 0.0],
['B', 2345, 20120504, 1.0],
['B', 4567, 20120504, 1.0],
['B', 8796, 20120606, 2.0]
], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])
Expected output :
df = pd.DataFrame(
[
['A', 1234, 20120201, 0.0],
['A', 1134, 20120201, 0.0],
['A', 1011, 20120201, 0.0],
['A', 1123, 20121004, 3.0],
['A', 1111, 20121004, 3.0],
['A', 1224, 20121105, 5.0],
['B', 1156, 20120403, 0.0],
['B', 2345, 20120504, 1.0],
['B', 4567, 20120504, 1.0],
['B', 8796, 20120606, 3.0]
], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])
Any suggestions to fix?
Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values.
Cumulative sum over a Pandas DataFrame or Series axisThe cumsum() function is used to get cumulative sum over a DataFrame or Series axis. Returns a DataFrame or Series of the same size containing the cumulative sum. The index or the name of the axis. 0 is equivalent to None or 'index'.
Cumulative Percentage is calculated by the mathematical formula of dividing the cumulative sum of the column by the mathematical sum of all the values and then multiplying the result by 100. This is also applicable in Pandas Data frames.
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
First, let's count the number of invoices paid on each day for each company:
tmp1 = df.groupby(['company', 'date']).size().rename('totalpaidinvoices')
Then for each company, we need to count how many invoices were paid prior to the current period. That's a job for cumsum
:
tmp2 = tmp1.groupby('company').apply(lambda s: s.cumsum() - s)
And finally, merge the calculation with the original dataframe:
df.merge(tmp2, left_on=['company', 'date'], right_index=True)
If you prefer method chaining:
result = (
df.groupby(['company', 'date'])
.size()
.groupby('company')
.apply(lambda s: s.cumsum() - s)
.to_frame('totalpaidinvoices')
.merge(df, how='right', left_index=True, right_on=['company', 'date'])
)
If your data is sorted, you can try:
df = df.merge(
df.groupby(["company", "date"])
.size()
.groupby(level=0)
.apply(lambda x: x.shift(1).fillna(0).cumsum())
.reset_index(),
on=["date", "company"],
).rename(columns={0: "totalpaidinvoices"})
print(df)
Prints:
company invoice date totalpaidinvoices
0 A 1234 20120201 0.0
1 A 1134 20120201 0.0
2 A 1011 20120201 0.0
3 A 1123 20121004 3.0
4 A 1111 20121004 3.0
5 A 1224 20121105 5.0
6 B 1156 20120403 0.0
7 B 2345 20120504 1.0
8 B 4567 20120504 1.0
9 B 8796 20120606 3.0
I thought I was making it too complicated switching from cumcount
to boolean indexing, but based on the other answers, it seems this is actually the most concise (and potentially efficient) solution:
for company in df.company.unique():
df.loc[df.company==company, 'total_paid_invoices'] = df.date.apply(
lambda x: df.loc[(df.date<x)&(df.company==company)].shape[0]
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With