Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove cancelling rows from Pandas Dataframe

Tags:

I have a list of invoices sent out to customers. However, sometimes a bad invoice is sent, which is later cancelled. My Pandas Dataframe looks something like this, except much larger (~3 million rows)

index | customer | invoice_nr | amount | date
---------------------------------------------------
0     | 1        | 1          | 10     | 01-01-2016
1     | 1        | 1          | -10    | 01-01-2016
2     | 1        | 1          | 11     | 01-01-2016
3     | 1        | 2          | 10     | 02-01-2016
4     | 2        | 3          | 7      | 01-01-2016
5     | 2        | 4          | 12     | 02-01-2016
6     | 2        | 4          | 8      | 02-01-2016
7     | 2        | 4          | -12    | 02-01-2016
8     | 2        | 4          | 4      | 02-01-2016
...   | ...      | ...        | ...    | ...
...   | ...      | ...        | ...    | ...

Now, I want to drop all rows for which the customer, invoice_nr and date are identical, but the amount has opposite values.
Corrections of invoices always take place on the same day with identical invoice number. The invoice number is uniquely bound to the customer and always corresponds to one transaction (which can consist of multiple components, for example for customer = 2, invoice_nr = 4). Corrections of invoices only occur either to change amount charged, or to split amount in smaller components. Hence, the cancelled value is not repeated on the same invoice_nr.

Any help how to program this would be much appreciated.

like image 807
Niels Alebregtse Avatar asked Aug 08 '16 13:08

Niels Alebregtse


People also ask

How do you get rid of unwanted rows in Pandas?

To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.

How do I drop a row based on conditions?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do I delete multiple rows in a DataFrame in Python?

To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.

How do I drop a null row in a data frame?

Drop all rows having at least one null valueDataFrame. dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.


2 Answers

def remove_cancelled_transactions(df):
    trans_neg = df.amount < 0
    return df.loc[~(trans_neg | trans_neg.shift(-1))]

groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()]
df.groupby(groups, as_index=False, group_keys=False) \
  .apply(remove_cancelled_transactions)

enter image description here

like image 83
piRSquared Avatar answered Sep 24 '22 16:09

piRSquared


You can use filter all values, where each group has values where sum is 0 and modulo by 2 is 0:

print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)))

       customer  invoice_nr  amount        date
index                                          
0             1           1      10  01-01-2016
1             1           1     -10  01-01-2016
5             2           4      12  02-01-2016
6             2           4     -12  02-01-2016

idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index

print (idx)      
Int64Index([0, 1, 5, 6], dtype='int64', name='index')

print (df.drop(idx))  
       customer  invoice_nr  amount        date
index                                          
2             1           1      11  01-01-2016
3             1           2      10  02-01-2016
4             2           3       7  01-01-2016
7             2           4       8  02-01-2016
8             2           4       4  02-01-2016

EDIT by comment:

If in real data are not duplicates for one invoice and one customer and one date, so you can use this way:

 print (df)
   index  customer  invoice_nr  amount        date
0      0         1           1      10  01-01-2016
1      1         1           1     -10  01-01-2016
2      2         1           1      11  01-01-2016
3      3         1           2      10  02-01-2016
4      4         2           3       7  01-01-2016
5      5         2           4      12  02-01-2016
6      6         2           4     -12  02-01-2016
7      7         2           4       8  02-01-2016
8      8         2           4       4  02-01-2016

df['amount_abs'] = df.amount.abs()
df.drop_duplicates(['customer','invoice_nr', 'date', 'amount_abs'], keep=False, inplace=True)
df.drop('amount_abs', axis=1, inplace=True)
print (df)
   index  customer  invoice_nr  amount        date
2      2         1           1      11  01-01-2016
3      3         1           2      10  02-01-2016
4      4         2           3       7  01-01-2016
7      7         2           4       8  02-01-2016
8      8         2           4       4  02-01-2016
like image 35
jezrael Avatar answered Sep 24 '22 16:09

jezrael