Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: search list of keywords in the text column and tag it

I have bag of words as elements in a list format. I am trying to search if each and every single of these words appear in the pandas data frame ONLY if it 'startswith' the element in the list. I have tried 'startswith' and 'contains' to compare.

Code:

import pandas as pd
# list of words to search for
searchwords = ['harry','harry potter','secret garden']

# Data
l1 = [1, 2, 3,4,5]
l2 = ['Harry Potter is a great book',
      'Harry Potter is very famous',
      'I enjoyed reading Harry Potter series',
      'LOTR is also a great book along',
      'Have you read Secret Garden as well?'
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()

# Preview df:
    id  text
0   1   harry potter is a great book
1   2   harry potter is very famous
2   3   i enjoyed reading harry potter series
3   4   lotr is also a great book along
4   5   have you read secret garden as well?

Try #1:

When I run this command it picks it up and gives me the results through out the text column. Not what I am looking for. I just used to check if I am doing things right for an example reasons for my understanding.
df[df['text'].str.contains('|'.join(searchwords))]

Try #2: When I run this command it returns nothing. Why is that? I am doing something wrong? When I search 'harry' as single it works, but not when I pass in the list of elements.

df[df['text'].str.startswith('harry')] # works with single string.
df[df['text'].str.startswith('|'.join(searchwords))] # returns nothing! 
like image 567
sharp Avatar asked Oct 18 '25 15:10

sharp


2 Answers

Use startswith with a tuple

Ex:

searchwords = ['harry','harry potter','secret garden']

# Data
l1 = [1, 2, 3,4,5]
l2 = ['Harry Potter is a great book',
      'Harry Potter is very famous',
      'I enjoyed reading Harry Potter series',
      'LOTR is also a great book along',
      'Have you read Secret Garden as well?'
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()

print(df[df['text'].str.startswith(tuple(searchwords))] )

Output:

   id                          text
0   1  harry potter is a great book
1   2   harry potter is very famous
like image 136
Rakesh Avatar answered Oct 20 '25 07:10

Rakesh


since startswith accepts str and no regex, use str.findall

df[df['text'].str.findall('^(?:'+'|'.join(searchwords) + ')').apply(len) > 0]

Output

   id                          text
0   1  harry potter is a great book
1   2   harry potter is very famous
like image 26
iamklaus Avatar answered Oct 20 '25 07:10

iamklaus



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!