I am trying to use apply to avoid an iterrows()
iterator in a function:
However that pandas method is poorly documented and I can't find example on how to use it, except for the lame .apply(sq.rt)
in the documentation... No example on how to use arguments etc...
Anyway, here a toy example on what I try to do.
In my understanding apply
will actually do the same as iterrows()
, ie, iterate (over the rows if axis=0). On each iteration the input x
of the function should be the row iterated over. However the error messages I keep receiving sort of disprove that assumption...
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
x[3]=x[0]*x[1]
df = df.apply(multiply, axis=0)
The example above returns an empty df. Can anyone shed some light on my misunderstanding?
The apply() method allows you to apply a function along one of the axis of the DataFrame, default 0, which is the index (row) axis.
apply() method. This function acts as a map() function in Python. It takes a function as an input and applies this function to an entire DataFrame. If you are working with tabular data, you must specify an axis you want your function to act on ( 0 for columns; and 1 for rows).
DataFrame - apply() function The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).
Pandas Apply Function to Single Column We will create a function add_3() which adds value 3 column value and use this on apply() function. To apply it to a single column, qualify the column name using df["col_name"] . The below example applies a function to a column B .
import pandas as pd
import numpy as np
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
return x[0]*x[1]
df['multiply'] = df.apply(multiply, axis = 1)
print(df)
Results in:
0 1 multiply
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162
Explanation:
The function you are apply
ing, needs to return a value. You are also applying this to each row, not column. The axis
parameter you passed was incorrect in this regard.
Finally, notice that I am setting this equal to the 'multiply'
column outside of my function. You can easily change this to be df[3] = ...
like you have and get a dataframe like this:
0 1 3
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162
It should be noted that you can use lambda functions as well. See their documentation Apply
For your example, you can run:
df['multiply'] = df.apply(lambda row: row[0] * row[1], axis = 1)
which produces the same output as @Andy
This can be useful if your function is in the form of
def multiply(a,b):
return a*b
df['multiply'] = df.apply(lambda row: multiply(row[0] ,row[1]), axis = 1)
More examples in the section Enhancing Performance
When apply
-ing a function, you need that function to return the result for that operation over the column/row. You are getting None
because multiply
doesn't return, evidently. That is, apply
should be returning a result between particular values, not doing the assignment itself.
You're also iterating over the wrong axis, here. Your current code takes the first and second element of each column and multiplies them together.
A correct multiply
function:
def multiply(x):
return x[0]*x[1]
df[3] = df.apply(multiply, 'columns')
With that being said, you can do much better than apply
here, as it is not a vectorized operation. Just multiply the columns together directly.
df[3] = df[0]*df[1]
In general, you should avoid apply
when possible as it is not much more than a loop itself under the hood.
One of the rules of Pandas Zen says: always try to find a vectorized solution first
.
.apply(..., axis=1)
is not vectorized!
Consider alternatives:
In [164]: df.prod(axis=1)
Out[164]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
In [165]: df[0] * df[1]
Out[165]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
Timing against 50.000 rows DF:
In [166]: df = pd.concat([df] * 10**4, ignore_index=True)
In [167]: df.shape
Out[167]: (50000, 2)
In [168]: %timeit df.apply(multiply, axis=1)
1 loop, best of 3: 6.12 s per loop
In [169]: %timeit df.prod(axis=1)
100 loops, best of 3: 6.23 ms per loop
In [170]: def multiply_vect(x1, x2):
...: return x1*x2
...:
In [171]: %timeit multiply_vect(df[0], df[1])
1000 loops, best of 3: 604 µs per loop
Conclusion: use .apply()
as a very last resort (i.e. when nothing else helps)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With