Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalable solution for str.contains with list of strings in pandas

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference list.

Currently, I do it like this:

reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]

Which is not not scalable to thousands of words. However, when I do:

df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]

I yield the error first argument must be string or compiled pattern.

Following this solution, I tried:

reference_list: "words|to|remove" 
df1 = df1[~df1[0].str.contains(reference_list)]

Which doesn't raise an exception but doesn't parse all words eather.

How to effectively use str.contains with a list of words?

like image 335
sudonym Avatar asked Dec 22 '17 07:12

sudonym


People also ask

How do you scale a Pandas feature?

The min-max feature scaling The min-max approach (often called normalization) rescales the feature to a fixed range of [0,1] by subtracting the minimum value of the feature and then dividing by the range. We can apply the min-max scaling in Pandas using the . min() and . max() methods.

How do you check if a string contains a substring in pandas DataFrame?

You can use .str.contains() on a pandas column and pass it the substring as an argument to filter for rows that contain the substring.

Can pandas contain list?

You can insert a list of values into a cell in Pandas DataFrame using DataFrame.at() , DataFrame. iat() , and DataFrame. loc() methods.

What is the pandas Dtype for storing string data?

Pandas uses the object dtype for storing strings.


1 Answers

For a scalable solution, do the following -

  1. join the contents of words by the regex OR pipe |
  2. pass this to str.contains
  3. use the result to filter df1

To index the 0th column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).

words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]

Note: This will also work if words is a Series.


Alternatively, if your 0th column is a column of words only (not sentences), then you can use df.isin, which should be faster -

df1 = df1[~df1.iloc[:, 0].isin(words)]
like image 67
cs95 Avatar answered Sep 28 '22 08:09

cs95