Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deleting rows which sum to zero in 1 column but are otherwise duplicates in pandas

Tags:

python

pandas

I have a pandas dataframe of the following structure:

df = pd.DataFrame({'ID':['A001', 'A001', 'A001', 'A002', 'A002', 'A003', 'A003', 'A004', 'A004', 'A004', 'A005', 'A005'],
                   'Val1':[2, 2, 2, 5, 6, 8, 8, 3, 3, 3, 7, 7],
                   'Val2':[100, -100, 50, -40, 40, 60, -50, 10, -10, 10, 15, 15]})
    ID    Val1  Val2
 0  A001     2   100
 1  A001     2  -100
 2  A001     2    50
 3  A002     5   -40
 4  A002     6    40
 5  A003     8    60
 6  A003     8   -50
 7  A004     3    10
 8  A004     3   -10
 9  A004     3    10
10  A005     7    15
11  A005     7    15

I want to remove duplicate rows where ID and Val1 are duplicates, and where Val2 sums to zero across two rows. The positive/negative Val2 rows may not be consecutive either, even under a groupby

In the above sample data, rows 0 and 1, as well as 7, 8, 9 fulfill these criteria. I'd want to remove [0, 1], and either [7, 8] or [8, 9].

Another constraint here is that there could be entirely duplicate rows ([10, 11]). In this case, I want to keep both rows.

The desired output is thus:

    ID    Val1  Val2
 2  A001     2    50
 3  A002     5   -40
 4  A002     6    40
 5  A003     8    60
 6  A003     8   -50
 9  A004     3    10
10  A005     7    15
11  A005     7    15

Short of iterating over each row and looking for other rows which fit the criteria, I'm out of ideas for a more "pythonic" way to do this. Any help is much appreciated.

like image 282
weirdpotatoes Avatar asked Oct 15 '20 11:10

weirdpotatoes


People also ask

How do you delete duplicate rows in pandas based on a column?

Delete Duplicate Rows based on Specific Columns To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows.

How do I delete rows conditionally pandas?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do I sum only certain rows in pandas?

Practical Data Science using Python To sum only specific rows, use the loc() method. Mention the beginning and end row index using the : operator. Using loc(), you can also set the columns to be included. We can display the result in a new column.


1 Answers

I put some comments in the code, so hopefully, my line of thought should be clear :

cond = df.assign(temp=df.Val2.abs())
# a way to get the same values (differentiated by their sign)
# to follow each other
cond = cond.sort_values(["ID", "Val1", "temp"])

# cumsum should yield a zero for numbers that are different
# only by their sign
cond["check"] = cond.groupby(["ID", "temp"]).Val2.cumsum()
cond["check"] = np.where(cond.check != 0, np.nan, cond.check)

# the backward fill here allows us to assign an identifier
# to the two values that summed to zero
cond["check"] = cond["check"].bfill(limit=1)

# this is where we implement your other condition
# essentially, it looks for rows that are duplicates
# and rows that any two rows sum to zero
cond.loc[
    ~(cond.duplicated(["ID", "Val1"], keep=False) & (cond.check == 0)),
    ["ID", "Val1", "Val2"],
]



     ID Val1    Val2
2   A001    2   50
3   A002    5   -40
4   A002    6   40
6   A003    8   -50
5   A003    8   60
9   A004    3   10
like image 127
sammywemmy Avatar answered Oct 05 '22 23:10

sammywemmy