Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace
has a bit of overhead, and explicitly telling the Series what to do via map
yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get
function. Any keys that are not present will return NaN
via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replace
can do incomplete substring matches, while .map
requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace
on a dict of all possible values (obtained e.g. with .value_counts().index
) and then go over all rows of the Series with this dict and .map
. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace
alone would take 15.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With