Fastest way to replace part of a string in Pandas series if it contains a word in a list

Tags:

I have a large dataset all_transcripts with almost 3 million rows. One of the columns msgText contains written messages.

>>> all_transcripts['msgText']

['this is my first message']
['second message is here']
['this is my third message']

Furthermore, I have a list with 200+ words, called gemeentes.

>>> gemeentes
['first','second','third' ... ]

If a word in this list is contained in msgText, I want to replace it by another word. To do so, I created the function:

def replaceCity(text):
    newText = text.replace(plaatsnaam, 'woonplaats')
    return str(newText)

So, my desired output would look like:

['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']

Currently, I am looping through the list and for every item in my list, apply the replaceCityfunction.

for plaatsnaam in gemeentes:
    global(plaatsnaam)
    all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)

However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?

This post (Algorithm to find multiple string matches) is similar, however my problem is different because:

here there is only one big piece small of text, while I have a dataset with many different rows
I want to replace words, rather than just finding the words.

599

asked May 01 '19 09:05

Emil

1 Answers

Assuming all_transcripts is a pandas DataFrame:

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

Example:

all_transcripts = pd.DataFrame([['this is my first message'],
                                ['second message is here'],
                                ['this is my third message']],
                               columns=['msgText'])
gemeentes = ['first','second','third']

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

outputs

0    this is my woonplaats message
1       woonplaats message is here
2    this is my woonplaats message

116

answered Oct 13 '22 00:10

Dan

Related questions
                            
                                Import failure of s3fs library in AWS Glue
                            
                                Pandas: Filling data for missing dates
                            
                                Numpy tobytes() with defined byteorder
                            
                                calling a function with delay
                            
                                What's the fastest way to copy values from one tensor to another in PyTorch?
                            
                                Pandas groupby for multiple values in a column
                            
                                Skip directory name in import path by importing subpackage in __init__.py
                            
                                Numpy array with different standard deviation per row
                            
                                Pyspark error on creating dataframe: 'StructField' object has no attribute 'encode'
                            
                                How draw box across multiple axes on matplotlib using ax position as reference
                            
                                Why does custom Python object cannot be used with ParDo Fn?
                            
                                How to I make my AI algorithm play 9 board tic-tac-toe?
                            
                                ImageDataGenerator: how to add the 4th dimension to a numpy array?
                            
                                S3 Select retrieve headers in the CSV
                            
                                Building Python3.7.3 from source missing '_ctypes'
                            
                                what is the default encoding when python Requests post data is string type?
                            
                                ValueError: No module named 'notmigrations' during unit tests
                            
                                Most pythonic way to collect warnings from a function
                            
                                Create an ordered Index in sqlite db using SQLAlchemy
                            
                                ctypes.ArgumentError when using kivy with pywinauto

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to replace part of a string in Pandas series if it contains a word in a list

Tags:

python

replace

list

pandas

Emil

People also ask

1 Answers

Dan

Recent Activity

Donate For Us