Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run function exactly once for each row in a Pandas dataframe

If I have a function

def do_irreversible_thing(a, b):
    print a, b

And a dataframe, say

df = pd.DataFrame([(0, 1), (2, 3), (4, 5)], columns=['a', 'b'])

What's the best way to run the function exactly once for each row in a pandas dataframe. As pointed out in other questions, something like df.apply pandas will call the function twice for the first row. Even using numpy

np.vectorize(do_irreversible_thing)(df.a, df.b)

causes the function to be called twice on the first row, as will df.T.apply() or df.apply(..., axis=1).

Is there a faster or cleaner way to call the function with every row than this explicit loop?

   for idx, a, b in df.itertuples():
       do_irreversible_thing(a, b)
like image 454
David Nehme Avatar asked Apr 13 '16 20:04

David Nehme


People also ask

How do you run a function on each row of a DataFrame Python?

Apply Function to Every Row of DataFrameBy using apply() you call a function to every row of pandas DataFrame. Here the add() function will be applied to every row of pandas DataFrame. In order to iterate row by row in apply() function use axis=1 .

Is Iterrows faster than apply?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

What does DF Iterrows () do?

iterrows() is used to iterate over a pandas Data frame rows in the form of (index, series) pair. This function iterates over the data frame column, it will return a tuple with the column name and content in form of series.

Why is Itertuples faster than Iterrows?

The reason iterrows() is slower than itertuples() is due to iterrows() doing a lot of type checks in the lifetime of its call.


2 Answers

The way I do it (because I also don't like the idea of looping with df.itertuples) is:

df.apply(do_irreversible_thing, axis=1)

and then your function should be like:

def do_irreversible_thing(x):
    print x.a, x.b

this way you should be able to run your function over each row.

OR

if you can't modify your function you could apply it like this

df.apply(lambda x: do_irreversible_thing(x[0],x[1]), axis=1)
like image 110
Rosa Alejandra Avatar answered Sep 28 '22 12:09

Rosa Alejandra


It's unclear what your function is doing but to apply a function to each row you can do so by passing axis=1 to apply your function row-wise and pass the column elements of interest:

In [155]:
def foo(a,b):
    return a*b
​
df = pd.DataFrame([(0, 1), (2, 3), (4, 5)], columns=['a', 'b'])
df.apply(lambda x: foo(x['a'], x['b']), axis=1)

Out[155]:
0     0
1     6
2    20
dtype: int64

However, so long as your function does not depend on the df mutating on each row, then you can just use a vectorised method to operate on the entire column:

In [156]:
df['a'] * df['b']

Out[156]:
0     0
1     6
2    20
dtype: int64

The reason is that because the functions are vectorised then it will scale better whilst the apply is just syntactic sugar for iterating on your df so it's a for loop essentially

like image 31
EdChum Avatar answered Sep 28 '22 14:09

EdChum