Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to replace part of a string in Pandas series if it contains a word in a list

I have a large dataset all_transcripts with almost 3 million rows. One of the columns msgText contains written messages.

>>> all_transcripts['msgText']

['this is my first message']
['second message is here']
['this is my third message']

Furthermore, I have a list with 200+ words, called gemeentes.

>>> gemeentes
['first','second','third' ... ]

If a word in this list is contained in msgText, I want to replace it by another word. To do so, I created the function:

def replaceCity(text):
    newText = text.replace(plaatsnaam, 'woonplaats')
    return str(newText)

So, my desired output would look like:

['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']

Currently, I am looping through the list and for every item in my list, apply the replaceCityfunction.

for plaatsnaam in gemeentes:
    global(plaatsnaam)
    all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)

However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?


This post (Algorithm to find multiple string matches) is similar, however my problem is different because:

  • here there is only one big piece small of text, while I have a dataset with many different rows

  • I want to replace words, rather than just finding the words.

like image 599
Emil Avatar asked May 01 '19 09:05

Emil


People also ask

How do I change the values in pandas series based on conditions?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

How can I replace part of a string in pandas DataFrame column?

You can replace substring of pandas DataFrame column by using DataFrame. replace() method. This method by default finds the exact sting match and replaces it with the specified value. Use regex=True to replace substring.

How do you replace part of a string in Python?

In Python, the . replace() method and the re. sub() function are often used to clean up text by removing strings or substrings or replacing them.


1 Answers

Assuming all_transcripts is a pandas DataFrame:

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

Example:

all_transcripts = pd.DataFrame([['this is my first message'],
                                ['second message is here'],
                                ['this is my third message']],
                               columns=['msgText'])
gemeentes = ['first','second','third']

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

outputs

0    this is my woonplaats message
1       woonplaats message is here
2    this is my woonplaats message
like image 116
Dan Avatar answered Oct 13 '22 00:10

Dan