Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search and Replace in pandas dataframe for large dataset

I have a dataset of size 1 million and type dataframe.

Id      description
 1      bc single phase acr
 2      conditioning accum
 3      dsply value ac
and dictionary of size 2927 which looks like as follow:
Key     Value
accum   accumulator
bb      baseboard
dsply   display

executed the following code to replace the dictionary key found in dataframe with its value

dataset=dataset.replace(dict, regex=True)

but it will consume more time to excecute i.e 104.07914903743769 sec for 2000 dataset and have 8GB RAM I need to apply this code for millions of dataset. so can anyone tell me how to reduce the excecution time? and also is there any alternate way to do the task?

like image 368
Shylashree Avatar asked Feb 20 '18 13:02

Shylashree


1 Answers

I see a ~15% improvement precompiling regex.

But for optimal performance see @unutbu's excellent solution.

import pandas as pd
import re

rep_dict = {'accum': 'accumulator', 'bb': 'baseboard', 'dsply': 'display'}
pattern = re.compile("|".join([re.escape(k) for k in rep_dict.keys()]), re.M)

def multiple_replace(string):    
    return pattern.sub(lambda x: rep_dict[x.group(0)], string)

df = pd.DataFrame({'description': ['bc single phase acr', 'conditioning accum', 'dsply value ac']})
df = pd.concat([df]*10000)

%timeit df['description'].map(multiple_replace)          # 72.8 ms per loop
%timeit df['description'].replace(rep_dict, regex=True)  # 88.6 ms per loop
like image 152
jpp Avatar answered Oct 04 '22 19:10

jpp