Hello I have a pandas dataframe that I want to clean.Here is an example:
| IDBILL | IDBUYER | BILL | DATE |
|---|---|---|---|
| 001 | 768787 | 45 | 1897-07-24 |
| 001 | 768787 | 67 | 1897-07-24 |
| 001 | 768787 | 98 | 1897-07-24 |
| 002 | 768787 | 30 | 1897-07-24 |
| 002 | 768787 | 15 | 1897-07-24 |
| 002 | 768787 | 12 | 1897-07-24 |
| 005 | 786545 | 45 | 1897-08-19 |
| 008 | 657676 | 89 | 1989-09-23 |
| 009 | 657676 | 42 | 1989-09-23 |
| 010 | 657676 | 18 | 1989-09-23 |
| 012 | 657676 | 51 | 1990-03-10 |
| 016 | 892354 | 73 | 1990-03-10 |
| 018 | 892354 | 48 | 1765-02-14 |
| 020 | 892354 | 62 | 1765-02-14 |
I want to delete the highest bills(and keep the lowest when the bills are made on the same day, by the same IDBUYER, and whose bills IDs follow each other. To get this:
| IDBILL | IDBUYER | BILL | DATE |
|---|---|---|---|
| 002 | 768787 | 30 | 1897-07-24 |
| 002 | 768787 | 15 | 1897-07-24 |
| 002 | 768787 | 12 | 1897-07-24 |
| 005 | 786545 | 45 | 1897-08-19 |
| 010 | 657676 | 18 | 1989-09-23 |
| 012 | 657676 | 51 | 1990-03-10 |
| 016 | 892354 | 73 | 1990-03-10 |
| 018 | 892354 | 48 | 1765-02-14 |
| 020 | 892354 | 62 | 1765-02-14 |
Thank you in advance
One solution:
df = df.sort_values('BILL')
df.loc[df.assign(cc = df.groupby(['DATE','IDBUYER',df.groupby(['DATE','IDBUYER'])['IDBILL'].transform(lambda x: x.diff().gt(1).cumsum())]).cumcount(),cc2 = df.groupby(['DATE','IDBUYER','IDBILL']).transform('count'),floor = lambda x: ~(x['cc'].floordiv(x['cc2'],axis=0).astype(bool)))['floor']].sort_index()
or:
(df.loc[df.groupby(['DATE',df['IDBILL'].diff().gt(1).cumsum()])['BILL']
.rank('dense').eq(1)
.groupby([df['IDBILL'],df['IDBUYER']]).transform('any')])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With