I have a pandas dataframe of the following structure:
df = pd.DataFrame({'ID':['A001', 'A001', 'A001', 'A002', 'A002', 'A003', 'A003', 'A004', 'A004', 'A004', 'A005', 'A005'],
'Val1':[2, 2, 2, 5, 6, 8, 8, 3, 3, 3, 7, 7],
'Val2':[100, -100, 50, -40, 40, 60, -50, 10, -10, 10, 15, 15]})
ID Val1 Val2
0 A001 2 100
1 A001 2 -100
2 A001 2 50
3 A002 5 -40
4 A002 6 40
5 A003 8 60
6 A003 8 -50
7 A004 3 10
8 A004 3 -10
9 A004 3 10
10 A005 7 15
11 A005 7 15
I want to remove duplicate rows where ID and Val1 are duplicates, and where Val2 sums to zero across two rows. The positive/negative Val2 rows may not be consecutive either, even under a groupby
In the above sample data, rows 0 and 1, as well as 7, 8, 9 fulfill these criteria. I'd want to remove [0, 1], and either [7, 8] or [8, 9].
Another constraint here is that there could be entirely duplicate rows ([10, 11]). In this case, I want to keep both rows.
The desired output is thus:
ID Val1 Val2
2 A001 2 50
3 A002 5 -40
4 A002 6 40
5 A003 8 60
6 A003 8 -50
9 A004 3 10
10 A005 7 15
11 A005 7 15
Short of iterating over each row and looking for other rows which fit the criteria, I'm out of ideas for a more "pythonic" way to do this. Any help is much appreciated.
Delete Duplicate Rows based on Specific Columns To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows.
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
Practical Data Science using Python To sum only specific rows, use the loc() method. Mention the beginning and end row index using the : operator. Using loc(), you can also set the columns to be included. We can display the result in a new column.
I put some comments in the code, so hopefully, my line of thought should be clear :
cond = df.assign(temp=df.Val2.abs())
# a way to get the same values (differentiated by their sign)
# to follow each other
cond = cond.sort_values(["ID", "Val1", "temp"])
# cumsum should yield a zero for numbers that are different
# only by their sign
cond["check"] = cond.groupby(["ID", "temp"]).Val2.cumsum()
cond["check"] = np.where(cond.check != 0, np.nan, cond.check)
# the backward fill here allows us to assign an identifier
# to the two values that summed to zero
cond["check"] = cond["check"].bfill(limit=1)
# this is where we implement your other condition
# essentially, it looks for rows that are duplicates
# and rows that any two rows sum to zero
cond.loc[
~(cond.duplicated(["ID", "Val1"], keep=False) & (cond.check == 0)),
["ID", "Val1", "Val2"],
]
ID Val1 Val2
2 A001 2 50
3 A002 5 -40
4 A002 6 40
6 A003 8 -50
5 A003 8 60
9 A004 3 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With