How to replace values in a Pandas series s
via a dictionary d
has been asked and re-asked many times.
The recommended method (1, 2, 3, 4) is to either use s.replace(d)
or, occasionally, use s.map(d)
if all your series values are found in the dictionary keys.
However, performance using s.replace
is often unreasonably slow, often 5-10x slower than a simple list comprehension.
The alternative, s.map(d)
has good performance, but is only recommended when all keys are found in the dictionary.
Why is s.replace
so slow and how can performance be improved?
import pandas as pd, numpy as np df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)}) lst = df['A'].values.tolist() ##### TEST 1 ##### d = {i: i+1 for i in range(1000)} %timeit df['A'].replace(d) # 1.98s %timeit [d[i] for i in lst] # 134ms ##### TEST 2 ##### d = {i: i+1 for i in range(10)} %timeit df['A'].replace(d) # 20.1ms %timeit [d.get(i, i) for i in lst] # 243ms
Note: This question is not marked as a duplicate because it is looking for specific advice on when to use different methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.
You can use df. replace({"Courses": dict}) to remap/replace values in pandas DataFrame with Dictionary values. It allows you the flexibility to replace the column values with regular expressions for regex substitutions.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.
General case
df['A'].map(d)
if all values mapped; ordf['A'].map(d).fillna(df['A']).astype(int)
if >5% values mapped.Few, e.g. < 5%, values in d
df['A'].replace(d)
The "crossover point" of ~5% is specific to Benchmarking below.
Interestingly, a simple list comprehension generally underperforms map
in either scenario.
Benchmarking
import pandas as pd, numpy as np df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)}) lst = df['A'].values.tolist() ##### TEST 1 - Full Map ##### d = {i: i+1 for i in range(1000)} %timeit df['A'].replace(d) # 1.98s %timeit df['A'].map(d) # 84.3ms %timeit [d[i] for i in lst] # 134ms ##### TEST 2 - Partial Map ##### d = {i: i+1 for i in range(10)} %timeit df['A'].replace(d) # 20.1ms %timeit df['A'].map(d).fillna(df['A']).astype(int) # 111ms %timeit [d.get(i, i) for i in lst] # 243ms
Explanation
The reason why s.replace
is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.
This is an excerpt from replace()
in pandas\generic.py
.
items = list(compat.iteritems(to_replace)) keys, values = zip(*items) are_mappings = [is_dict_like(v) for v in values] if any(are_mappings): # handling of nested dictionaries else: to_replace, value = keys, values return self.replace(to_replace, value, inplace=inplace, limit=limit, regex=regex)
There appear to be many steps involved:
This can be compared to much leaner code from map()
in pandas\series.py
:
if isinstance(arg, (dict, Series)): if isinstance(arg, dict): arg = self._constructor(arg, index=arg.keys()) indexer = arg.index.get_indexer(values) new_values = algos.take_1d(arg._values, indexer)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With