Using replace efficiently in pandas

Tags:

I am looking to use the replace function in an efficient way in python3. The code I have is achieving the task, but is much too slow, as I am working with a large dataset. Thus, my priority is efficiency over elegancy whenever there is a tradeoff. Here is a toy of what I would like to do:

import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])

       1st  2nd
   0    1    2
   1    3    4
   2    5    6


idxDict= dict()
idxDict[1] = 'a'
idxDict[3] = 'b'
idxDict[5] = 'c'

for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

Which gives

as I desire, but it takes way too long. What would be the fastest way?

Edit: this is a more focused and clean question than this one, for which the solution is similar.

222

asked Feb 02 '17 21:02

splinter

1 Answers

use map to perform a lookup:

In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
  1st  2nd
0   a    2
1   b    4
2   c    6

to avoid the situation where there is no valid key you can pass na_action='ignore'

You can also use df['1st'].replace(idxDict) but to answer you question about efficiency:

timings

In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop

In [70]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 3.25 ms per loop

So using map is over 3x faster here

on a larger dataset:

In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape

Out[3]:
(30000, 2)

In [4]:    
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop

In [5]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 18.2 ms per loop

For 30K row df, map is ~4x faster so it scales better than replace or looping

133

answered Sep 22 '22 02:09

EdChum

Related questions
                            
                                Convert 3d Numpy array to 2d
                            
                                Custom Python gTTS voice
                            
                                Single worker thread for all tasks or multiple specific workers?
                            
                                How to remove the adjacent duplicate value in a numpy array?
                            
                                Appending more datasets into an existing Hdf5 file without deleting other groups and datasets
                            
                                What effect do the different URL parameters of the Sphinx HTML output's search feature have?
                            
                                multi_line hover in bokeh
                            
                                Set PYTHONPATH for cron jobs in shared hosting
                            
                                Spoofing IP address when web scraping (python)
                            
                                Ordering users by date created in django admin panel
                            
                                Pandas groupby object filtering
                            
                                PyJWT returning invalid token signatures
                            
                                iPython with different env (using anaconda)
                            
                                How to set gunicorn limit_request_line parameter over 8190?
                            
                                Create NumberLong integer using PyMongo
                            
                                How to create a multilevel dataframe in pandas?
                            
                                Python: Copying named tuples with same attributes / fields
                            
                                pymongo update_one(), upsert=True without using $ operators
                            
                                Tensorflow MNIST: terminate called after throwing an instance of 'std::bad_alloc'
                            
                                Django url warning urls.W002

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using replace efficiently in pandas

Tags:

python

indexing

pandas

dataframe

series

splinter

People also ask

1 Answers

EdChum

Recent Activity

Donate For Us