I have a list of invoices sent out to customers. However, sometimes a bad invoice is sent, which is later cancelled. My Pandas Dataframe looks something like this, except much larger (~3 million rows) <pre class="prettyprint"><code>index | customer | invoice_nr | amount | date --------------------------------------------------- 0 | 1 | 1 | 10 | 01-01-2016 1 | 1 | 1 | -10 | 01-01-2016 2 | 1 | 1 | 11 | 01-01-2016 3 | 1 | 2 | 10 | 02-01-2016 4 | 2 | 3 | 7 | 01-01-2016 5 | 2 | 4 | 12 | 02-01-2016 6 | 2 | 4 | 8 | 02-01-2016 7 | 2 | 4 | -12 | 02-01-2016 8 | 2 | 4 | 4 | 02-01-2016 ... | ... | ... | ... | ... ... | ... | ... | ... | ... </code></pre> Now, I want to drop all rows for which the <code>customer</code>, <code>invoice_nr</code> and <code>date</code> are identical, but the <code>amount</code> has opposite values. Corrections of invoices always take place on the same day with identical invoice number. The invoice number is uniquely bound to the customer and always corresponds to one transaction (which can consist of multiple components, for example for <code>customer = 2</code>, <code>invoice_nr = 4</code>). Corrections of invoices only occur either to change <code>amount</code> charged, or to split <code>amount</code> in smaller components. Hence, the cancelled value is not repeated on the same <code>invoice_nr</code>. Any help how to program this would be much appreciated.

You can use <code>filter</code> all values, where each group has values where sum is <code>0</code> and modulo by <code>2</code> is <code>0</code>: <pre class="prettyprint"><code>print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()]) .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0))) customer invoice_nr amount date index 0 1 1 10 01-01-2016 1 1 1 -10 01-01-2016 5 2 4 12 02-01-2016 6 2 4 -12 02-01-2016 idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()]) .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index print (idx) Int64Index([0, 1, 5, 6], dtype='int64', name='index') print (df.drop(idx)) customer invoice_nr amount date index 2 1 1 11 01-01-2016 3 1 2 10 02-01-2016 4 2 3 7 01-01-2016 7 2 4 8 02-01-2016 8 2 4 4 02-01-2016 </code></pre> EDIT by comment: If in real data are not duplicates for one invoice and one customer and one date, so you can use this way: <pre class="prettyprint"><code> print (df) index customer invoice_nr amount date 0 0 1 1 10 01-01-2016 1 1 1 1 -10 01-01-2016 2 2 1 1 11 01-01-2016 3 3 1 2 10 02-01-2016 4 4 2 3 7 01-01-2016 5 5 2 4 12 02-01-2016 6 6 2 4 -12 02-01-2016 7 7 2 4 8 02-01-2016 8 8 2 4 4 02-01-2016 df['amount_abs'] = df.amount.abs() df.drop_duplicates(['customer','invoice_nr', 'date', 'amount_abs'], keep=False, inplace=True) df.drop('amount_abs', axis=1, inplace=True) print (df) index customer invoice_nr amount date 2 2 1 1 11 01-01-2016 3 3 1 2 10 02-01-2016 4 4 2 3 7 01-01-2016 7 7 2 4 8 02-01-2016 8 8 2 4 4 02-01-2016 </code></pre>

Remove cancelling rows from Pandas Dataframe

Tags:

I have a list of invoices sent out to customers. However, sometimes a bad invoice is sent, which is later cancelled. My Pandas Dataframe looks something like this, except much larger (~3 million rows)

index | customer | invoice_nr | amount | date
---------------------------------------------------
0     | 1        | 1          | 10     | 01-01-2016
1     | 1        | 1          | -10    | 01-01-2016
2     | 1        | 1          | 11     | 01-01-2016
3     | 1        | 2          | 10     | 02-01-2016
4     | 2        | 3          | 7      | 01-01-2016
5     | 2        | 4          | 12     | 02-01-2016
6     | 2        | 4          | 8      | 02-01-2016
7     | 2        | 4          | -12    | 02-01-2016
8     | 2        | 4          | 4      | 02-01-2016
...   | ...      | ...        | ...    | ...
...   | ...      | ...        | ...    | ...

Now, I want to drop all rows for which the customer, invoice_nr and date are identical, but the amount has opposite values.
Corrections of invoices always take place on the same day with identical invoice number. The invoice number is uniquely bound to the customer and always corresponds to one transaction (which can consist of multiple components, for example for customer = 2, invoice_nr = 4). Corrections of invoices only occur either to change amount charged, or to split amount in smaller components. Hence, the cancelled value is not repeated on the same invoice_nr.

Any help how to program this would be much appreciated.

807

asked Aug 08 '16 13:08

Niels Alebregtse

2 Answers

def remove_cancelled_transactions(df):
    trans_neg = df.amount < 0
    return df.loc[~(trans_neg | trans_neg.shift(-1))]

groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()]
df.groupby(groups, as_index=False, group_keys=False) \
  .apply(remove_cancelled_transactions)

enter image description here

answered Sep 24 '22 16:09

piRSquared

You can use filter all values, where each group has values where sum is 0 and modulo by 2 is 0:

print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)))

       customer  invoice_nr  amount        date
index                                          
0             1           1      10  01-01-2016
1             1           1     -10  01-01-2016
5             2           4      12  02-01-2016
6             2           4     -12  02-01-2016

idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index

print (idx)      
Int64Index([0, 1, 5, 6], dtype='int64', name='index')

print (df.drop(idx))  
       customer  invoice_nr  amount        date
index                                          
2             1           1      11  01-01-2016
3             1           2      10  02-01-2016
4             2           3       7  01-01-2016
7             2           4       8  02-01-2016
8             2           4       4  02-01-2016

EDIT by comment:

If in real data are not duplicates for one invoice and one customer and one date, so you can use this way:

 print (df)
   index  customer  invoice_nr  amount        date
0      0         1           1      10  01-01-2016
1      1         1           1     -10  01-01-2016
2      2         1           1      11  01-01-2016
3      3         1           2      10  02-01-2016
4      4         2           3       7  01-01-2016
5      5         2           4      12  02-01-2016
6      6         2           4     -12  02-01-2016
7      7         2           4       8  02-01-2016
8      8         2           4       4  02-01-2016

df['amount_abs'] = df.amount.abs()
df.drop_duplicates(['customer','invoice_nr', 'date', 'amount_abs'], keep=False, inplace=True)
df.drop('amount_abs', axis=1, inplace=True)
print (df)
   index  customer  invoice_nr  amount        date
2      2         1           1      11  01-01-2016
3      3         1           2      10  02-01-2016
4      4         2           3       7  01-01-2016
7      7         2           4       8  02-01-2016
8      8         2           4       4  02-01-2016

answered Sep 24 '22 16:09

jezrael

Related questions
                            
                                Is it possible to overload operators for native datatypes?
                            
                                How to localize Display attribute in Asp.Net Core 1.0.0?
                            
                                Return types within a Task running an async anonymous function?
                            
                                primary shards in mongodb
                            
                                What is the difference between provider and instances in Angular?
                            
                                List of Emoji's required in XML, JSON or CSV format
                            
                                Open Pdf in cordova app
                            
                                Are javascript property names like "__proto__" not standardized in ES5/6?
                            
                                How to count from two fields in mongoDB
                            
                                Tracking affiliate generated app installs in iOS greater than 24 hours
                            
                                Dynamically loading DbSet to DbContext
                            
                                Input placeholder vertical align

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With