If I have a function
def do_irreversible_thing(a, b):
print a, b
And a dataframe, say
df = pd.DataFrame([(0, 1), (2, 3), (4, 5)], columns=['a', 'b'])
What's the best way to run the function exactly once for each row in a pandas dataframe. As pointed out in other questions, something like df.apply pandas will call the function twice for the first row. Even using numpy
np.vectorize(do_irreversible_thing)(df.a, df.b)
causes the function to be called twice on the first row, as will df.T.apply()
or df.apply(..., axis=1).
Is there a faster or cleaner way to call the function with every row than this explicit loop?
for idx, a, b in df.itertuples():
do_irreversible_thing(a, b)
Apply Function to Every Row of DataFrameBy using apply() you call a function to every row of pandas DataFrame. Here the add() function will be applied to every row of pandas DataFrame. In order to iterate row by row in apply() function use axis=1 .
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
iterrows() is used to iterate over a pandas Data frame rows in the form of (index, series) pair. This function iterates over the data frame column, it will return a tuple with the column name and content in form of series.
The reason iterrows() is slower than itertuples() is due to iterrows() doing a lot of type checks in the lifetime of its call.
The way I do it (because I also don't like the idea of looping with df.itertuples) is:
df.apply(do_irreversible_thing, axis=1)
and then your function should be like:
def do_irreversible_thing(x):
print x.a, x.b
this way you should be able to run your function over each row.
OR
if you can't modify your function you could apply
it like this
df.apply(lambda x: do_irreversible_thing(x[0],x[1]), axis=1)
It's unclear what your function is doing but to apply
a function to each row you can do so by passing axis=1
to apply
your function row-wise and pass the column elements of interest:
In [155]:
def foo(a,b):
return a*b
df = pd.DataFrame([(0, 1), (2, 3), (4, 5)], columns=['a', 'b'])
df.apply(lambda x: foo(x['a'], x['b']), axis=1)
Out[155]:
0 0
1 6
2 20
dtype: int64
However, so long as your function does not depend on the df mutating on each row, then you can just use a vectorised method to operate on the entire column:
In [156]:
df['a'] * df['b']
Out[156]:
0 0
1 6
2 20
dtype: int64
The reason is that because the functions are vectorised then it will scale better whilst the apply
is just syntactic sugar for iterating on your df so it's a for
loop essentially
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With