Replace values in a pandas series via dictionary efficiently

Tags:

How to replace values in a Pandas series s via a dictionary d has been asked and re-asked many times.

The recommended method (1, 2, 3, 4) is to either use s.replace(d) or, occasionally, use s.map(d) if all your series values are found in the dictionary keys.

However, performance using s.replace is often unreasonably slow, often 5-10x slower than a simple list comprehension.

The alternative, s.map(d) has good performance, but is only recommended when all keys are found in the dictionary.

Why is s.replace so slow and how can performance be improved?

import pandas as pd, numpy as np  df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)}) lst = df['A'].values.tolist()  ##### TEST 1 #####  d = {i: i+1 for i in range(1000)}  %timeit df['A'].replace(d)                          # 1.98s %timeit [d[i] for i in lst]                         # 134ms  ##### TEST 2 #####  d = {i: i+1 for i in range(10)}  %timeit df['A'].replace(d)                          # 20.1ms %timeit [d.get(i, i) for i in lst]                  # 243ms

Note: This question is not marked as a duplicate because it is looking for specific advice on when to use different methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.

504

asked Mar 13 '18 15:03

jpp

1 Answers

One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

General case

Use df['A'].map(d) if all values mapped; or
Use df['A'].map(d).fillna(df['A']).astype(int) if >5% values mapped.

Few, e.g. < 5%, values in d

Use df['A'].replace(d)

The "crossover point" of ~5% is specific to Benchmarking below.

Interestingly, a simple list comprehension generally underperforms map in either scenario.

Benchmarking

import pandas as pd, numpy as np  df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)}) lst = df['A'].values.tolist()  ##### TEST 1 - Full Map #####  d = {i: i+1 for i in range(1000)}  %timeit df['A'].replace(d)                          # 1.98s %timeit df['A'].map(d)                              # 84.3ms %timeit [d[i] for i in lst]                         # 134ms  ##### TEST 2 - Partial Map #####  d = {i: i+1 for i in range(10)}  %timeit df['A'].replace(d)                          # 20.1ms %timeit df['A'].map(d).fillna(df['A']).astype(int)  # 111ms %timeit [d.get(i, i) for i in lst]                  # 243ms

Explanation

The reason why s.replace is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

This is an excerpt from replace() in pandas\generic.py.

items = list(compat.iteritems(to_replace)) keys, values = zip(*items) are_mappings = [is_dict_like(v) for v in values]  if any(are_mappings):     # handling of nested dictionaries else:     to_replace, value = keys, values  return self.replace(to_replace, value, inplace=inplace,                     limit=limit, regex=regex)

There appear to be many steps involved:

Converting dictionary to a list.
Iterating through list and checking for nested dictionaries.
Feeding an iterator of keys and values into a replace function.

This can be compared to much leaner code from map() in pandas\series.py:

if isinstance(arg, (dict, Series)):     if isinstance(arg, dict):         arg = self._constructor(arg, index=arg.keys())      indexer = arg.index.get_indexer(values)     new_values = algos.take_1d(arg._values, indexer)

131

answered Nov 03 '22 18:11

jpp

Related questions
                            
                                VSCode split editor move file instead of copy
                            
                                iOS Generic type for codable property in Swift
                            
                                graphql, union scalar type?
                            
                                Unable to start ReactiveWebApplicationContext due to missing ReactiveWebServerFactory bean
                            
                                Improve PySpark DataFrame.show output to fit Jupyter notebook
                            
                                rxjs 6 Property 'of' does not exist on type 'typeof Observable'
                            
                                JSON Parse error: Unrecognized token'<' - react-native
                            
                                Does TensorFlow 1.9 support Python 3.7
                            
                                Flutter: Add box shadow to a transparent Container
                            
                                How to LEFT ANTI join under some matching condition
                            
                                How do you delete a sprint in VSTS (Visual Studio Team Services)
                            
                                Sorting an Array in Random Order

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replace values in a pandas series via dictionary efficiently

Tags:

jpp

People also ask

1 Answers

jpp

Recent Activity

Donate For Us