Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pandas apply calculate twice

I'm using the apply method on a panda's DataFrame object. When my DataFrame has a single column, it appears that the applied function is being called twice. The questions are why? And, can I stop that behavior?

Code:

import pandas as pd  def mul2(x):     print ('hello')     return 2*x  df = pd.DataFrame({'a': [1,2,0.67,1.34]}) df.apply(mul2) 

Output:

hello hello  0  2.00 1  4.00 2  1.34 3  2.68 

I'm printing 'hello' from within the function being applied. I know it's being applied twice because 'hello' printed twice. What's more is that if I had two columns, 'hello' prints 3 times. Even more still is when I call applied to just the column 'hello' prints 4 times.

Code:

df.a.apply(mul2) 

Output:

hello hello hello hello 0    2.00 1    4.00 2    1.34 3    2.68 Name: a, dtype: float64 
like image 659
piRSquared Avatar asked Feb 07 '14 19:02

piRSquared


People also ask

Does pandas apply use multiple cores?

Operations on data frame using Pandas is slow, as it uses a single-core of CPU to perform the computations, and does not take advantage of a multi-core CPU.

Can pandas apply return two columns?

Return Multiple Columns from pandas apply() You can return a Series from the apply() function that contains the new data. pass axis=1 to the apply() function which applies the function multiply to each row of the DataFrame, Returns a series of multiple columns from pandas apply() function.

Are two columns equal pandas?

Method 2: Using equals() methods. This method Test whether two-column contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.


2 Answers

This behavior is intended, as an optimization.

See the docs:

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

like image 185
MERose Avatar answered Oct 05 '22 23:10

MERose


Probably related to this issue. With groupby, the applied function is called one extra time to see if certain optimizations can be done. I'd guess something similar is going on here. It doesn't look like there's any way around it at the moment (although I could be wrong about the source of the behavior you're seeing). Is there a reason you need it to not do that extra call.

Also, calling it four times when you apply on the column is normal. When you get one columnm you get a Series, not a DataFrame. apply on a Series applies the function to each element. Since your column has four elements in it, the function is called four times.

like image 25
BrenBarn Avatar answered Oct 06 '22 00:10

BrenBarn