Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching keywords (strings) with a Pandas Dataframe

I have a Dataframe that I want to match against some keywords (I want to detect the rows that contain those keywords) I managed to get the job this way. But I wonder if there's a better way to do it knowing that I might have up to 10 or 20 different keywords.

df1 = df[df['column1'].str.contains("keyword1") | df['column1'].str.contains('keyword2')]

(I'm a beginner please keep it as simple as possible)

like image 778
Zumplo Avatar asked Mar 02 '23 12:03

Zumplo


1 Answers

For or logic you can create a single pattern by joining the words with |. Store your 10-20 words in a list then '|'.join(that_list).

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': ['foo', 'bar', 'baz', 'foobar', 'boo']})
words = ['foo', 'bar']

df['foo_OR_bar'] = df['col1'].str.contains('|'.join(words))

#     col1  foo_OR_bar
#0     foo        True
#1     bar        True
#2     baz       False
#3  foobar        True
#4     boo       False

#To slice by that Boolean Series
df1 = df.loc[df['col1'].str.contains('|'.join(words))]

If your joining logic is and then we can use np.logical_and.reduce with a list comprehension to keep things compact.

df['foo_AND_bar'] = np.logical_and.reduce([df.col1.str.contains(w) for w in words])

#     col1  foo_OR_bar  foo_AND_bar
#0     foo        True        False
#1     bar        True        False
#2     baz       False        False
#3  foobar        True         True
#4     boo       False        False
like image 182
ALollz Avatar answered Mar 05 '23 09:03

ALollz