I have a list of particular words ('tokens') and need to find all of them (if any of them are present) in plain texts. I prefer using Pandas, to load text and perform the search. I'm using pandas as my collection of short text are timestamped and it is quite easy to organise these short text in a single data structure as pandas.
For example:
Consider a collection of fetched twitters uploaded in Pandas:
twitts
0 today is a great day for BWM
1 prices of german cars increased
2 Japan introduced a new model of Toyota
3 German car makers, such as BMW, Audi and VW mo...
and a list of car makers:
list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']
Ideally, I need to get the following data frame:
twitts cars_mentioned
0 today is a great day for BMW [BMW]
1 prices of german cars increased []
2 Japan introduced a new model of Toyota [Toyota]
3 German car makers, such as BMW, Audi and VW mo... [BMW, Audi, VW]
I'm very new to NLP and text mining methods, and I read/search on the internet a lot of materials on that topic. My guess is that I can use regex
and use re.findall()
, but then I need to iterate over the list of tokens (car makers) the entire dataframe.
Are there more succinct ways of doing this simple task, especially with Panads?
You can use the pandas .str
methods particularly .findall
:
df['cars_mentioned'] = df['twitts'].str.findall('|'.join(list_of_car_makers))
Use pandas.DataFrame.apply
df['cars_mentioned'] = df['twitts'].apply(lambda x: [c for c in list_of_car_makers if c in x])
You can use re.findall
and filter
.
list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))
Python sample:
list_of_car_makers = ['BMW', 'Audi','Mercedes','Toyota','Honda', 'VW']
def cars_mentioned(twitt):
return list(filter((lambda x: re.findall(x, twitt)), list_of_car_makers))
cars_mentioned('German car makers, such as BMW, Audi and VW mo...') >> ['BMW', 'Audi', 'VW']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With