I have a df (Pandas Dataframe) with three rows:
some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"
The function df.col_name.str.contains("apple|banana")
will catch all of the rows:
"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".
How do I apply AND operator to the str.contains()
method, so that it only grabs strings that contain BOTH "apple" & "banana"?
"apple and banana both are delicious"
I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)
Python string __contains__() is an instance method and returns boolean value True or False depending on whether the string object contains the specified string object or not. Note that the Python string contains() method is case sensitive.
contains() function is used to test if pattern or regex is contained within a string of a Series or Index. The function returns boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index. Parameter : pat : Character sequence or regular expression.
Using “contains” to Find a Substring in a Pandas DataFrame The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series. str. contains("substring") .
You can do that as follows:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
You can also do it in regex expression style:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
You can then, build your list of words into a regex string like so:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
will render:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
Then you can do your stuff dynamically.
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious"]})
targets = ['apple', 'banana']
# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0 True
1 True
2 True
Name: col, dtype: bool
# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0 False
1 False
2 True
Name: col, dtype: bool
This works
df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With