Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas, apply function

Tags:

python

pandas

I am trying to use apply to avoid an iterrows() iterator in a function:

However that pandas method is poorly documented and I can't find example on how to use it, except for the lame .apply(sq.rt) in the documentation... No example on how to use arguments etc...

Anyway, here a toy example on what I try to do.

In my understanding apply will actually do the same as iterrows(), ie, iterate (over the rows if axis=0). On each iteration the input x of the function should be the row iterated over. However the error messages I keep receiving sort of disprove that assumption...

grid = np.random.rand(5,2)
df = pd.DataFrame(grid)

def multiply(x):
    x[3]=x[0]*x[1]

df = df.apply(multiply, axis=0)

The example above returns an empty df. Can anyone shed some light on my misunderstanding?

like image 238
jim jarnac Avatar asked Apr 18 '17 19:04

jim jarnac


People also ask

What is apply () in Pandas?

The apply() method allows you to apply a function along one of the axis of the DataFrame, default 0, which is the index (row) axis.

What is apply () used for in Python?

apply() method. This function acts as a map() function in Python. It takes a function as an input and applies this function to an entire DataFrame. If you are working with tabular data, you must specify an axis you want your function to act on ( 0 for columns; and 1 for rows).

How do I apply a function to a DataFrame in Pandas?

DataFrame - apply() function The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).

How do I apply a function to a column in Pandas?

Pandas Apply Function to Single Column We will create a function add_3() which adds value 3 column value and use this on apply() function. To apply it to a single column, qualify the column name using df["col_name"] . The below example applies a function to a column B .


4 Answers

import pandas as pd
import numpy as np

grid = np.random.rand(5,2)
df = pd.DataFrame(grid)

def multiply(x):
    return x[0]*x[1]

df['multiply'] = df.apply(multiply, axis = 1)
print(df)

Results in:

          0         1  multiply
0  0.550750  0.713054  0.392715
1  0.061949  0.661614  0.040987
2  0.472134  0.783479  0.369907
3  0.827371  0.277591  0.229670
4  0.961102  0.137510  0.132162

Explanation:

The function you are applying, needs to return a value. You are also applying this to each row, not column. The axis parameter you passed was incorrect in this regard.

Finally, notice that I am setting this equal to the 'multiply' column outside of my function. You can easily change this to be df[3] = ... like you have and get a dataframe like this:

          0         1         3
0  0.550750  0.713054  0.392715
1  0.061949  0.661614  0.040987
2  0.472134  0.783479  0.369907
3  0.827371  0.277591  0.229670
4  0.961102  0.137510  0.132162
like image 94
Andy Avatar answered Oct 04 '22 05:10

Andy


It should be noted that you can use lambda functions as well. See their documentation Apply

For your example, you can run:

df['multiply'] = df.apply(lambda row: row[0] * row[1], axis = 1)

which produces the same output as @Andy

This can be useful if your function is in the form of

def multiply(a,b):
    return a*b

df['multiply'] = df.apply(lambda row: multiply(row[0] ,row[1]), axis = 1)

More examples in the section Enhancing Performance

like image 43
Jon Avatar answered Oct 04 '22 05:10

Jon


When apply-ing a function, you need that function to return the result for that operation over the column/row. You are getting None because multiply doesn't return, evidently. That is, apply should be returning a result between particular values, not doing the assignment itself.

You're also iterating over the wrong axis, here. Your current code takes the first and second element of each column and multiplies them together.

A correct multiply function:

def multiply(x):
    return x[0]*x[1]

df[3] = df.apply(multiply, 'columns')

With that being said, you can do much better than apply here, as it is not a vectorized operation. Just multiply the columns together directly.

df[3] = df[0]*df[1]

In general, you should avoid apply when possible as it is not much more than a loop itself under the hood.

like image 23
miradulo Avatar answered Oct 04 '22 06:10

miradulo


One of the rules of Pandas Zen says: always try to find a vectorized solution first.

.apply(..., axis=1) is not vectorized!

Consider alternatives:

In [164]: df.prod(axis=1)
Out[164]:
0    0.770675
1    0.539782
2    0.318027
3    0.597172
4    0.211643
dtype: float64

In [165]: df[0] * df[1]
Out[165]:
0    0.770675
1    0.539782
2    0.318027
3    0.597172
4    0.211643
dtype: float64

Timing against 50.000 rows DF:

In [166]: df = pd.concat([df] * 10**4, ignore_index=True)

In [167]: df.shape
Out[167]: (50000, 2)

In [168]: %timeit df.apply(multiply, axis=1)
1 loop, best of 3: 6.12 s per loop

In [169]: %timeit df.prod(axis=1)
100 loops, best of 3: 6.23 ms per loop

In [170]: def multiply_vect(x1, x2):
     ...:     return x1*x2
     ...:

In [171]: %timeit multiply_vect(df[0], df[1])
1000 loops, best of 3: 604 µs per loop

Conclusion: use .apply() as a very last resort (i.e. when nothing else helps)

like image 37
MaxU - stop WAR against UA Avatar answered Oct 04 '22 05:10

MaxU - stop WAR against UA