I have a Dataframe that I want to match against some keywords (I want to detect the rows that contain those keywords) I managed to get the job this way. But I wonder if there's a better way to do it knowing that I might have up to 10 or 20 different keywords.
df1 = df[df['column1'].str.contains("keyword1") | df['column1'].str.contains('keyword2')]
(I'm a beginner please keep it as simple as possible)
For or logic you can create a single pattern by joining the words with |
. Store your 10-20 words in a list then '|'.join(that_list)
.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['foo', 'bar', 'baz', 'foobar', 'boo']})
words = ['foo', 'bar']
df['foo_OR_bar'] = df['col1'].str.contains('|'.join(words))
# col1 foo_OR_bar
#0 foo True
#1 bar True
#2 baz False
#3 foobar True
#4 boo False
#To slice by that Boolean Series
df1 = df.loc[df['col1'].str.contains('|'.join(words))]
If your joining logic is and then we can use np.logical_and.reduce
with a list comprehension to keep things compact.
df['foo_AND_bar'] = np.logical_and.reduce([df.col1.str.contains(w) for w in words])
# col1 foo_OR_bar foo_AND_bar
#0 foo True False
#1 bar True False
#2 baz False False
#3 foobar True True
#4 boo False False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With