Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for all matches in texts with Pandas

I have a list of particular words ('tokens') and need to find all of them (if any of them are present) in plain texts. I prefer using Pandas, to load text and perform the search. I'm using pandas as my collection of short text are timestamped and it is quite easy to organise these short text in a single data structure as pandas.

For example:

Consider a collection of fetched twitters uploaded in Pandas:

                                              twitts
0                       today is a great day for BWM
1                    prices of german cars increased
2             Japan introduced a new model of Toyota
3  German car makers, such as BMW, Audi and VW mo...

and a list of car makers:

list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']

Ideally, I need to get the following data frame:

                                              twitts  cars_mentioned
0                       today is a great day for BMW  [BMW]
1                    prices of german cars increased  []
2             Japan introduced a new model of Toyota  [Toyota]
3  German car makers, such as BMW, Audi and VW mo...  [BMW, Audi, VW]

I'm very new to NLP and text mining methods, and I read/search on the internet a lot of materials on that topic. My guess is that I can use regex and use re.findall(), but then I need to iterate over the list of tokens (car makers) the entire dataframe.

Are there more succinct ways of doing this simple task, especially with Panads?

like image 449
Arnold Klein Avatar asked Feb 03 '18 22:02

Arnold Klein


3 Answers

You can use the pandas .str methods particularly .findall:

df['cars_mentioned'] = df['twitts'].str.findall('|'.join(list_of_car_makers))
like image 146
Alex Avatar answered Oct 22 '22 00:10

Alex


Use pandas.DataFrame.apply

df['cars_mentioned'] = df['twitts'].apply(lambda x: [c for c in list_of_car_makers if c in x])
like image 23
Ghilas BELHADJ Avatar answered Oct 22 '22 01:10

Ghilas BELHADJ


You can use re.findall and filter.

list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))

Python sample:

list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']

def cars_mentioned(twitt):
        return list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))

cars_mentioned('German car makers, such as BMW, Audi and VW mo...') >> ['BMW', 'Audi', 'VW']
like image 35
Srdjan M. Avatar answered Oct 22 '22 00:10

Srdjan M.