I am looking to use the replace
function in an efficient way in python3. The code I have is achieving the task, but is much too slow, as I am working with a large dataset. Thus, my priority is efficiency over elegancy whenever there is a tradeoff. Here is a toy of what I would like to do:
import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])
1st 2nd
0 1 2
1 3 4
2 5 6
idxDict= dict()
idxDict[1] = 'a'
idxDict[3] = 'b'
idxDict[5] = 'c'
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
Which gives
1st 2nd
0 a 2
1 b 4
2 c 6
as I desire, but it takes way too long. What would be the fastest way?
Edit: this is a more focused and clean question than this one, for which the solution is similar.
Pandas Replace: The Faster and Better Approach to Change Values of a Column. Replacing values on a dataframe can sometimes be very tricky. Bulk replacement in a large dataset could be difficult and slow. Yet, Pandas is flexible enough to do it better.
Pandas DataFrame replace() MethodThe replace() method replaces the specified value with another specified value. The replace() method searches the entire DataFrame and replaces every case of the specified value.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
use map
to perform a lookup:
In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
1st 2nd
0 a 2
1 b 4
2 c 6
to avoid the situation where there is no valid key you can pass na_action='ignore'
You can also use df['1st'].replace(idxDict)
but to answer you question about efficiency:
timings
In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop
In [70]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 3.25 ms per loop
So using map
is over 3x faster here
on a larger dataset:
In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape
Out[3]:
(30000, 2)
In [4]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop
In [5]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 18.2 ms per loop
For 30K row df, map
is ~4x faster so it scales better than replace
or looping
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With