<pre class="prettyprint"><code>import timeit import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10, 10)) dft = df[[True, False] * 5] # df = dft dft2 = dft.copy() new_data = np.random.rand(5, 10) print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100)) print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100)) </code></pre> On my laptop setting values in <code>dft</code> (the original subset) is about 160 times slower than setting values in <code>dft2</code> (a deep copy of <code>dft</code>). Why is this the case? Edit: Removed speculation about proxy objects. As c. leather suggests, this is likely because of a different codepath when setting values on a copy (<code>dft</code>) vs an original dataframe (<code>dft2</code>). Bonus question: removing the reference to the original DataFrame <code>df</code> (by uncommenting the <code>df = dft</code> line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?

This is not exactly a new question on SO. This, and this are related posts. This is the link to the current docs that explains it. The comments from @c.leather are on the right track. The problem is that <code>dft</code> is a view, not a copy of the dataframe <code>df</code>, as explained in the linked articles. But pandas cannot know whether it really is or not a copy and if the operation is safe or not, and as such there are a lot of checks going on to ensure that it is safe to perform the assignment, and that could be avoided by simply making a copy. This is a pertinent issue and there is a whole discussion at Github. I've seen a lot of suggestions, the one I like the most is that the docs should encourage the <code>df[[True,False] * 5].copy()</code> idiom, one may call it the slice & copy idiom. I could not find the exact checks, and on the github issue this performance nuance is only mentioned through some tweets a few developers posted noting the behavior. Maybe someone more involved in the pandas development can add some more input.

Setting values on Pandas DataFrame subset (copy) is slow

Tags:

performance

python

pandas

dataframe

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 10))

dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()

new_data = np.random.rand(5, 10)

print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))

On my laptop setting values in dft (the original subset) is about 160 times slower than setting values in dft2 (a deep copy of dft).

Why is this the case?

Edit: Removed speculation about proxy objects.

As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft) vs an original dataframe (dft2).

Bonus question: removing the reference to the original DataFrame df (by uncommenting the df = dft line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?

469

asked Jul 01 '16 23:07

Alex

1 Answers

This is not exactly a new question on SO. This, and this are related posts. This is the link to the current docs that explains it.

The comments from @c.leather are on the right track. The problem is that dft is a view, not a copy of the dataframe df, as explained in the linked articles. But pandas cannot know whether it really is or not a copy and if the operation is safe or not, and as such there are a lot of checks going on to ensure that it is safe to perform the assignment, and that could be avoided by simply making a copy.

This is a pertinent issue and there is a whole discussion at Github. I've seen a lot of suggestions, the one I like the most is that the docs should encourage the df[[True,False] * 5].copy() idiom, one may call it the slice & copy idiom.

I could not find the exact checks, and on the github issue this performance nuance is only mentioned through some tweets a few developers posted noting the behavior. Maybe someone more involved in the pandas development can add some more input.

answered Sep 29 '22 05:09

rll

Related questions
                            
                                Conda error on update: `conda.core.link:_execute(637): An error occurred while installing package 'None'. AssertionError()`
                            
                                Is there a need for bumpversion (or bump2version) when setuptools_scm is available?
                            
                                How do you create merge_asof functionality in PySpark?
                            
                                Pytorch "NCCL error": unhandled system error, NCCL version 2.4.8"
                            
                                How to pass variable arguments from bash script to python script
                            
                                Package-specific import hooks in Python
                            
                                Cleaning Python modules an have fresh start Mac OS X
                            
                                Validating client certificates in PyOpenSSL
                            
                                python-config missing
                            
                                Why is this code slower in Cython than in Python?
                            
                                How does a python web server overcomes GIL
                            
                                Why drawcontours in OpenCV doesn´t fill contours in the edge of the image?
                            
                                CPython - locking the GIL in the main thread
                            
                                Options for linting Cython code
                            
                                Psycopg2 db connection hangs on lost network connection
                            
                                How to profile a Jinja2 template?
                            
                                Python - multiprocessing for matplotlib griddata
                            
                                How to place xaxis grid over spectrogram in Python?
                            
                                What is the status of Functional Reactive Programming in Python?
                            
                                Fast logarithm calculation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With