Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using replace efficiently in pandas

I am looking to use the replace function in an efficient way in python3. The code I have is achieving the task, but is much too slow, as I am working with a large dataset. Thus, my priority is efficiency over elegancy whenever there is a tradeoff. Here is a toy of what I would like to do:

import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])

       1st  2nd
   0    1    2
   1    3    4
   2    5    6


idxDict= dict()
idxDict[1] = 'a'
idxDict[3] = 'b'
idxDict[5] = 'c'

for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

Which gives

     1st  2nd
   0   a    2
   1   b    4
   2   c    6

as I desire, but it takes way too long. What would be the fastest way?

Edit: this is a more focused and clean question than this one, for which the solution is similar.

like image 222
splinter Avatar asked Feb 02 '17 21:02

splinter


People also ask

Is pandas replace fast?

Pandas Replace: The Faster and Better Approach to Change Values of a Column. Replacing values on a dataframe can sometimes be very tricky. Bulk replacement in a large dataset could be difficult and slow. Yet, Pandas is flexible enough to do it better.

How do you replace items in pandas?

Pandas DataFrame replace() MethodThe replace() method replaces the specified value with another specified value. The replace() method searches the entire DataFrame and replaces every case of the specified value.

How do you replace values that meet a condition in pandas?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.


1 Answers

use map to perform a lookup:

In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
  1st  2nd
0   a    2
1   b    4
2   c    6

to avoid the situation where there is no valid key you can pass na_action='ignore'

You can also use df['1st'].replace(idxDict) but to answer you question about efficiency:

timings

In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop

In [70]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 3.25 ms per loop

So using map is over 3x faster here

on a larger dataset:

In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape

Out[3]:
(30000, 2)

In [4]:    
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop

In [5]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 18.2 ms per loop

For 30K row df, map is ~4x faster so it scales better than replace or looping

like image 133
EdChum Avatar answered Sep 22 '22 02:09

EdChum