Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Efficiently subset DataFrame based on strings containing certain values

To help illustrate what I want to achieve here is a DataFrame called df:

column1  column2  
1        foo faa
2        bar car
3        dog dog
4        cat rat
5        foo foo
6        bar cat
7        bird rat
8        cat dog
9        bird foo
10       bar car

I want to subset the DataFrame - the condition being that rows are dropped if a string in column2 contains one of multiple values.

This is easy enough for a single value, in this instance 'foo':

df = df[~df['column2'].str.contains("foo")]

But let's say I wanted to drop all rows in which the strings in column2 contained 'cat' or 'foo'. As applied to df above, this would drop 5 rows.

What would be the most efficient, most pythonic way to do this? This could either in the form of a function, multiple booleans or something else I'm not thinking of.

isin doesn't work as it requires exact matches.

N.B: I have edited this question as I made a mistake with it the first time round. Apologies.

like image 946
RDJ Avatar asked Feb 10 '26 17:02

RDJ


1 Answers

Use isin to test for membership of a list of values and negate ~ the boolean mask:

In [3]:
vals = ['bird','cat','foo']

df[~df['column2'].isin(vals)]
Out[3]:
   column1 column2
1        2     bar
2        3     dog
5        6     bar
9       10     bar

In [4]:
df['column2'].isin(vals)

Out[4]:
0     True
1    False
2    False
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: column2, dtype: bool
like image 161
EdChum Avatar answered Feb 12 '26 07:02

EdChum