import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 10))
dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()
new_data = np.random.rand(5, 10)
print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))
On my laptop setting values in dft
(the original subset) is about 160 times slower than setting values in dft2
(a deep copy of dft
).
Why is this the case?
Edit: Removed speculation about proxy objects.
As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft
) vs an original dataframe (dft2
).
Bonus question: removing the reference to the original DataFrame df
(by uncommenting the df = dft
line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?
Iterrows() is a Pandas inbuilt function to iterate through your data frame. It should be completely avoided as its performance is very slow compared to other iteration techniques.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
This is not exactly a new question on SO. This, and this are related posts. This is the link to the current docs that explains it.
The comments from @c.leather are on the right track. The problem is that dft
is a view, not a copy of the dataframe df
, as explained in the linked articles. But pandas cannot know whether it really is or not a copy and if the operation is safe or not, and as such there are a lot of checks going on to ensure that it is safe to perform the assignment, and that could be avoided by simply making a copy.
This is a pertinent issue and there is a whole discussion at Github. I've seen a lot of suggestions, the one I like the most is that the docs should encourage the df[[True,False] * 5].copy()
idiom, one may call it the slice & copy idiom.
I could not find the exact checks, and on the github issue this performance nuance is only mentioned through some tweets a few developers posted noting the behavior. Maybe someone more involved in the pandas development can add some more input.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With