Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignoring NaNs with str.contains

Tags:

python

pandas

People also ask

Is str contains case sensitive?

str. contains has a case parameter that is True by default. Set it to False to do a case insensitive match.

Where does STR contain?

contains. Test if pattern or regex is contained within a string of a Series or Index. Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

How do I remove NaN from pandas?

By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .


There's a flag for that:

In [11]: df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])

In [12]: df.a.str.contains("foo")
Out[12]:
0     True
1     True
2    False
3      NaN
Name: a, dtype: object

In [13]: df.a.str.contains("foo", na=False)
Out[13]:
0     True
1     True
2    False
3    False
Name: a, dtype: bool

See the str.replace docs:

na : default NaN, fill value for missing values.


So you can do the following:

In [21]: df.loc[df.a.str.contains("foo", na=False)]
Out[21]:
      a
0  foo1
1  foo2

In addition to the above answers, I would say for columns having no single word name, you may use:-

df[df['Product ID'].str.contains("foo") == True]

Hope this helps.


df[df.col.str.contains("foo").fillna(False)]

I'm not 100% on why (actually came here to search for the answer), but this also works, and doesn't require replacing all nan values.

import pandas as pd
import numpy as np

df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])

newdf = df.loc[df['a'].str.contains('foo') == True]

Works with or without .loc.

I have no idea why this works, as I understand it when you're indexing with brackets pandas evaluates whatever's inside the bracket as either True or False. I can't tell why making the phrase inside the brackets 'extra boolean' has any effect at all.


You can also use query method to query the columns of a DataFrame with a boolean expression as follows:

df.query('a.str.contains("foo", na=False)')

Note you might not get performance improvement, but it is more readable (arguably).