I am parsing a pandas dataframe <code>df1</code> containing string object rows. I have a reference list of keywords and need to delete every row in <code>df1</code> containing any word from the reference list. Currently, I do it like this: <pre class="prettyprint"><code>reference_list: ["words", "to", "remove"] df1 = df1[~df1[0].str.contains(r"words")] df1 = df1[~df1[0].str.contains(r"to")] df1 = df1[~df1[0].str.contains(r"remove")] </code></pre> Which is not not scalable to thousands of words. However, when I do: <pre class="prettyprint"><code>df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)] </code></pre> I yield the error first argument must be string or compiled pattern. Following this solution, I tried: <pre class="prettyprint"><code>reference_list: "words|to|remove" df1 = df1[~df1[0].str.contains(reference_list)] </code></pre> Which doesn't raise an exception but doesn't parse all words eather. How to effectively use str.contains with a list of words?

For a scalable solution, do the following - <ol> <li>join the contents of words by the regex OR pipe <code>|</code> </li> <li>pass this to <code>str.contains</code> </li> <li>use the result to filter <code>df1</code> </li> </ol> To index the 0th column, don't use <code>df1[0]</code> (as this might be considered ambiguous). It would be better to use <code>loc</code> or <code>iloc</code> (see below). <pre class="prettyprint"><code>words = ["words", "to", "remove"] mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words))) df1 = df1[~mask] </code></pre> Note: This will also work if <code>words</code> is a Series. <hr> Alternatively, if your 0th column is a column of words only (not sentences), then you can use <code>df.isin</code>, which should be faster - <pre class="prettyprint"><code>df1 = df1[~df1.iloc[:, 0].isin(words)] </code></pre>

Scalable solution for str.contains with list of strings in pandas

Tags:

python

string

regex

pandas

dataframe

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference list.

Currently, I do it like this:

reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]

Which is not not scalable to thousands of words. However, when I do:

df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]

I yield the error first argument must be string or compiled pattern.

Following this solution, I tried:

reference_list: "words|to|remove" 
df1 = df1[~df1[0].str.contains(reference_list)]

Which doesn't raise an exception but doesn't parse all words eather.

How to effectively use str.contains with a list of words?

335

asked Dec 22 '17 07:12

sudonym

1 Answers

For a scalable solution, do the following -

join the contents of words by the regex OR pipe |
pass this to str.contains
use the result to filter df1

To index the 0^th column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).

words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]

Note: This will also work if words is a Series.

Alternatively, if your 0^th column is a column of words only (not sentences), then you can use df.isin, which should be faster -

df1 = df1[~df1.iloc[:, 0].isin(words)]

answered Sep 28 '22 08:09

cs95

Related questions
                            
                                sort numpy array with custom predicate
                            
                                Django AllAuth - How to manually send a reset-password email?
                            
                                How to merge keras sequential models with same input?
                            
                                TypeError: a bytes-like object is required, not 'str' when opening Python 2 Pickle file in Python 3
                            
                                How to make pyttsx module's voice go slower
                            
                                Python multiprocessing throws error with argparse and pyinstaller
                            
                                Access GET parameter in AWS Lambda
                            
                                Define an algorithm which gets a number and a list and returns a scalar based on number's distance to average of the list
                            
                                How to identify which button is being pressed on PS4 controller using pygame
                            
                                Decorating Python's builtin print() function
                            
                                Python: How to save statsmodels results as image file?
                            
                                How to watch a variable in pudb?
                            
                                How can I edit the NGINX configuration on Google App Engine flexible environment?
                            
                                Loading very large CSV dataset into Python and R, Pandas struggles
                            
                                Pandas: how to convert a column with missing values to string?
                            
                                Verify BigQuery table existence
                            
                                How to feed input with changing size in Tensorflow
                            
                                How to use pos_tag in NLTK?
                            
                                Error while using pymysql in flask
                            
                                Get color of a scatter point

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With