I have a large dataset all_transcripts
with almost 3 million rows. One of the columns msgText
contains written messages.
>>> all_transcripts['msgText']
['this is my first message']
['second message is here']
['this is my third message']
Furthermore, I have a list with 200+ words, called gemeentes
.
>>> gemeentes
['first','second','third' ... ]
If a word in this list is contained in msgText
, I want to replace it by another word. To do so, I created the function:
def replaceCity(text):
newText = text.replace(plaatsnaam, 'woonplaats')
return str(newText)
So, my desired output would look like:
['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']
Currently, I am looping through the list and for every item in my list, apply the replaceCity
function.
for plaatsnaam in gemeentes:
global(plaatsnaam)
all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)
However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?
This post (Algorithm to find multiple string matches) is similar, however my problem is different because:
here there is only one big piece small of text, while I have a dataset with many different rows
I want to replace words, rather than just finding the words.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
You can replace substring of pandas DataFrame column by using DataFrame. replace() method. This method by default finds the exact sting match and replaces it with the specified value. Use regex=True to replace substring.
In Python, the . replace() method and the re. sub() function are often used to clean up text by removing strings or substrings or replacing them.
Assuming all_transcripts
is a pandas DataFrame
:
all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')
Example:
all_transcripts = pd.DataFrame([['this is my first message'],
['second message is here'],
['this is my third message']],
columns=['msgText'])
gemeentes = ['first','second','third']
all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')
outputs
0 this is my woonplaats message
1 woonplaats message is here
2 this is my woonplaats message
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With