I have a dataset of size 1 million and type dataframe.
Id description 1 bc single phase acr 2 conditioning accum 3 dsply value acand dictionary of size 2927 which looks like as follow:
Key Value accum accumulator bb baseboard dsply display
executed the following code to replace the dictionary key found in dataframe with its value
dataset=dataset.replace(dict, regex=True)
but it will consume more time to excecute i.e 104.07914903743769 sec for 2000 dataset and have 8GB RAM I need to apply this code for millions of dataset. so can anyone tell me how to reduce the excecution time? and also is there any alternate way to do the task?
I see a ~15% improvement precompiling regex.
But for optimal performance see @unutbu's excellent solution.
import pandas as pd
import re
rep_dict = {'accum': 'accumulator', 'bb': 'baseboard', 'dsply': 'display'}
pattern = re.compile("|".join([re.escape(k) for k in rep_dict.keys()]), re.M)
def multiple_replace(string):
return pattern.sub(lambda x: rep_dict[x.group(0)], string)
df = pd.DataFrame({'description': ['bc single phase acr', 'conditioning accum', 'dsply value ac']})
df = pd.concat([df]*10000)
%timeit df['description'].map(multiple_replace) # 72.8 ms per loop
%timeit df['description'].replace(rep_dict, regex=True) # 88.6 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With