Searching for all matches in texts with Pandas

Question

I have a list of particular words ('tokens') and need to find all of them (if any of them are present) in plain texts. I prefer using Pandas, to load text and perform the search. I'm using pandas as my collection of short text are timestamped and it is quite easy to organise these short text in a single data structure as pandas.

For example:

Consider a collection of fetched twitters uploaded in Pandas:

                                              twitts
0                       today is a great day for BWM
1                    prices of german cars increased
2             Japan introduced a new model of Toyota
3  German car makers, such as BMW, Audi and VW mo...

and a list of car makers:

list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']

Ideally, I need to get the following data frame:

                                              twitts  cars_mentioned
0                       today is a great day for BMW  [BMW]
1                    prices of german cars increased  []
2             Japan introduced a new model of Toyota  [Toyota]
3  German car makers, such as BMW, Audi and VW mo...  [BMW, Audi, VW]

I'm very new to NLP and text mining methods, and I read/search on the internet a lot of materials on that topic. My guess is that I can use regex and use re.findall(), but then I need to iterate over the list of tokens (car makers) the entire dataframe.

Are there more succinct ways of doing this simple task, especially with Panads?

Alex · Accepted Answer

You can use the pandas .str methods particularly .findall:

df['cars_mentioned'] = df['twitts'].str.findall('|'.join(list_of_car_makers))

Ghilas BELHADJ · Answer

Use pandas.DataFrame.apply

df['cars_mentioned'] = df['twitts'].apply(lambda x: [c for c in list_of_car_makers if c in x])

Srdjan M. · Answer

You can use re.findall and filter.

list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))

Python sample:

list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']

def cars_mentioned(twitt):
        return list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))

cars_mentioned('German car makers, such as BMW, Audi and VW mo...') >> ['BMW', 'Audi', 'VW']

Searching for all matches in texts with Pandas

Tags:

python

regex

pandas

nlp

Arnold Klein

3 Answers

Alex

Ghilas BELHADJ

Srdjan M.

Recent Activity

Donate For Us

Searching for all matches in texts with Pandas

Tags:

python

regex

pandas

nlp

Arnold Klein

3 Answers

Alex

Ghilas BELHADJ

Srdjan M.

Related questions

Recent Activity

Donate For Us