Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting values on Pandas DataFrame subset (copy) is slow

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 10))

dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()

new_data = np.random.rand(5, 10)

print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))

On my laptop setting values in dft (the original subset) is about 160 times slower than setting values in dft2 (a deep copy of dft).

Why is this the case?

Edit: Removed speculation about proxy objects.

As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft) vs an original dataframe (dft2).

Bonus question: removing the reference to the original DataFrame df (by uncommenting the df = dft line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?

like image 469
Alex Avatar asked Jul 01 '16 23:07

Alex


People also ask

Is pandas Iterrows slow?

Iterrows() is a Pandas inbuilt function to iterate through your data frame. It should be completely avoided as its performance is very slow compared to other iteration techniques.

Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.


1 Answers

This is not exactly a new question on SO. This, and this are related posts. This is the link to the current docs that explains it.

The comments from @c.leather are on the right track. The problem is that dft is a view, not a copy of the dataframe df, as explained in the linked articles. But pandas cannot know whether it really is or not a copy and if the operation is safe or not, and as such there are a lot of checks going on to ensure that it is safe to perform the assignment, and that could be avoided by simply making a copy.

This is a pertinent issue and there is a whole discussion at Github. I've seen a lot of suggestions, the one I like the most is that the docs should encourage the df[[True,False] * 5].copy() idiom, one may call it the slice & copy idiom.

I could not find the exact checks, and on the github issue this performance nuance is only mentioned through some tweets a few developers posted noting the behavior. Maybe someone more involved in the pandas development can add some more input.

like image 97
rll Avatar answered Sep 29 '22 05:09

rll