I have a list of invoices sent out to customers. However, sometimes a bad invoice is sent, which is later cancelled. My Pandas Dataframe looks something like this, except much larger (~3 million rows)
index | customer | invoice_nr | amount | date
---------------------------------------------------
0 | 1 | 1 | 10 | 01-01-2016
1 | 1 | 1 | -10 | 01-01-2016
2 | 1 | 1 | 11 | 01-01-2016
3 | 1 | 2 | 10 | 02-01-2016
4 | 2 | 3 | 7 | 01-01-2016
5 | 2 | 4 | 12 | 02-01-2016
6 | 2 | 4 | 8 | 02-01-2016
7 | 2 | 4 | -12 | 02-01-2016
8 | 2 | 4 | 4 | 02-01-2016
... | ... | ... | ... | ...
... | ... | ... | ... | ...
Now, I want to drop all rows for which the customer
, invoice_nr
and date
are identical, but the amount
has opposite values.
Corrections of invoices always take place on the same day with identical invoice number. The invoice number is uniquely bound to the customer and always corresponds to one transaction (which can consist of multiple components, for example for customer = 2
, invoice_nr = 4
). Corrections of invoices only occur either to change amount
charged, or to split amount
in smaller components. Hence, the cancelled value is not repeated on the same invoice_nr
.
Any help how to program this would be much appreciated.
To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.
Drop all rows having at least one null valueDataFrame. dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.
def remove_cancelled_transactions(df):
trans_neg = df.amount < 0
return df.loc[~(trans_neg | trans_neg.shift(-1))]
groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()]
df.groupby(groups, as_index=False, group_keys=False) \
.apply(remove_cancelled_transactions)
You can use filter
all values, where each group has values where sum is 0
and modulo by 2
is 0
:
print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
.filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)))
customer invoice_nr amount date
index
0 1 1 10 01-01-2016
1 1 1 -10 01-01-2016
5 2 4 12 02-01-2016
6 2 4 -12 02-01-2016
idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
.filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index
print (idx)
Int64Index([0, 1, 5, 6], dtype='int64', name='index')
print (df.drop(idx))
customer invoice_nr amount date
index
2 1 1 11 01-01-2016
3 1 2 10 02-01-2016
4 2 3 7 01-01-2016
7 2 4 8 02-01-2016
8 2 4 4 02-01-2016
EDIT by comment:
If in real data are not duplicates for one invoice and one customer and one date, so you can use this way:
print (df)
index customer invoice_nr amount date
0 0 1 1 10 01-01-2016
1 1 1 1 -10 01-01-2016
2 2 1 1 11 01-01-2016
3 3 1 2 10 02-01-2016
4 4 2 3 7 01-01-2016
5 5 2 4 12 02-01-2016
6 6 2 4 -12 02-01-2016
7 7 2 4 8 02-01-2016
8 8 2 4 4 02-01-2016
df['amount_abs'] = df.amount.abs()
df.drop_duplicates(['customer','invoice_nr', 'date', 'amount_abs'], keep=False, inplace=True)
df.drop('amount_abs', axis=1, inplace=True)
print (df)
index customer invoice_nr amount date
2 2 1 1 11 01-01-2016
3 3 1 2 10 02-01-2016
4 4 2 3 7 01-01-2016
7 7 2 4 8 02-01-2016
8 8 2 4 4 02-01-2016
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With