Search and Replace in pandas dataframe for large dataset

Question

I have a dataset of size 1 million and type dataframe.

Id      description
 1      bc single phase acr
 2      conditioning accum
 3      dsply value ac

and dictionary of size 2927 which looks like as follow:

Key     Value
accum   accumulator
bb      baseboard
dsply   display

executed the following code to replace the dictionary key found in dataframe with its value

dataset=dataset.replace(dict, regex=True)

but it will consume more time to excecute i.e 104.07914903743769 sec for 2000 dataset and have 8GB RAM I need to apply this code for millions of dataset. so can anyone tell me how to reduce the excecution time? and also is there any alternate way to do the task?

jpp · Accepted Answer

I see a ~15% improvement precompiling regex.

But for optimal performance see @unutbu's excellent solution.

import pandas as pd
import re

rep_dict = {'accum': 'accumulator', 'bb': 'baseboard', 'dsply': 'display'}
pattern = re.compile("|".join([re.escape(k) for k in rep_dict.keys()]), re.M)

def multiple_replace(string):    
    return pattern.sub(lambda x: rep_dict[x.group(0)], string)

df = pd.DataFrame({'description': ['bc single phase acr', 'conditioning accum', 'dsply value ac']})
df = pd.concat([df]*10000)

%timeit df['description'].map(multiple_replace)          # 72.8 ms per loop
%timeit df['description'].replace(rep_dict, regex=True)  # 88.6 ms per loop

Search and Replace in pandas dataframe for large dataset

Tags:

python

regex

pandas

Shylashree

1 Answers

jpp

Recent Activity

Donate For Us

Search and Replace in pandas dataframe for large dataset

Tags:

python

regex

pandas

Shylashree

1 Answers

jpp

Related questions

Recent Activity

Donate For Us