Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to drop rows in Pandas dataframe by multiple criteria imposed on two columns?

Tags:

python

pandas

Here's a toy example that captures my problem. Any help please? Thanks!

d = {'a': [1,1,1,2,2,2,3,3,3],
     'b': [1,2,3,1,2,3,1,2,3]}

df = pd.DataFrame(d)

Aiming for this result:

I want to drop two rows with (a,b) = (1,3) or (2,1).

result = pd.DataFrame({'a': [1,1,2,2,3,3,3],
                       'b': [1,2,2,3,1,2,3]})

In reality, I would have an exclusion list that will be updated with time: excl = [[1,3],[2,1],[3,4],........]

like image 853
Kelvin Yuen Avatar asked Jan 27 '23 13:01

Kelvin Yuen


2 Answers

This feels like firing a cannon when we should be able to just wave our hands, but:

df = pd.DataFrame({'a': [1,1,1,1,2,2,2,3,3,3],
                   'b': [1,1,2,3,1,2,3,1,2,3]})

excl = [[1, 3], [2, 1]]
keep = df.merge(pd.DataFrame(excl, columns=['a','b']),
                how='left', indicator=True)._merge == 'left_only'

gives me

In [91]: df.loc[keep]
Out[91]: 
   a  b
0  1  1
1  1  1
2  1  2
5  2  2
6  2  3
7  3  1
8  3  2
9  3  3

(Note I added a duplicate 1,1 row for sanity purposes.)

Crazy method #2: use (effectively) a categorical encoding:

codes = pd.concat([df, edf], sort=False).groupby(["a","b"]).ngroup()
keep = ~codes.iloc[:len(df)].isin(codes.iloc[len(df):])
df = df.loc[keep]
like image 83
DSM Avatar answered Feb 21 '23 10:02

DSM


Convert the list of "forbidden" rows into a dataframe with the column names different from the original dataframe:

to_drop = pd.DataFrame(excl, columns=('c','d')) # Different column names!

Merge the two dataframes. There will be NaNs where there is a mismatch:

combined = df.merge(to_drop, how='outer', left_on=['a','b'], right_on=['c','d'])

Take any column originally from the second dataframe, find out where the NaNs are, and use their indexes to extract valid rows from the first dataframe:

df[combined.isnull()['d']]
#   a  b
#0  1  1
#1  1  2
#4  2  2
#5  2  3
#6  3  1
#7  3  2
#8  3  3

You may see a warning:

UserWarning: Boolean Series key will be reindexed to match DataFrame index.

You can disregard it for now.

like image 22
DYZ Avatar answered Feb 21 '23 10:02

DYZ