Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why apply sometimes isn't faster than for-loop in pandas dataframe?

Tags:

python

pandas

It seems apply could accelerate the operation process on dataframe in most cases but, when I use apply I don't find the speedup. Here is my example; I have a dataframe with two columns:

>>>df
index col1 col2
1 10 20
2 20 30
3 30 40

What I want to do is to calculate values for each row in the dataframe by implementing a function R(x) on col1 and the result will be divided by the values in col2. For example, the result of the first row should be R(10)/20.

This is my function which will be called in apply:

def _f(input):
    return R(input['col1'])/input['col2']

Then I call _f in apply: df.apply(_f, axis=1)

But, I find in this case, apply is much slower than a for loop, like

for i in list(df.index)
    new_df.loc[i] = R(df.loc[i,'col1'])/df.loc[i,'col2']

Can anyone explain the reason?

like image 453
Vision Avatar asked Aug 14 '16 01:08

Vision


1 Answers

It is my understanding that .apply is not generally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.

If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:

    if axis == 0:
        series_gen = (self._ixs(i, axis=1)
                      for i in range(len(self.columns)))
        res_index = self.columns
        res_columns = self.index
    elif axis == 1:
        res_index = self.index
        res_columns = self.columns
        values = self.values
        series_gen = (Series.from_array(arr, index=res_columns, name=name,
                                        dtype=dtype)
                      for i, (arr, name) in enumerate(zip(values,
                                                          res_index)))
    else:  # pragma : no cover
        raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))

    i = None
    keys = []
    results = {}
    if ignore_failures:
        successes = []
        for i, v in enumerate(series_gen):
            try:
                results[i] = func(v)
                keys.append(v.name)
                successes.append(i)
            except Exception:
                pass
        # so will work with MultiIndex
        if len(successes) < len(res_index):
            res_index = res_index.take(successes)
    else:
        try:
            for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)
        except Exception as e:
            if hasattr(e, 'args'):
                # make sure i is defined
                if i is not None:
                    k = res_index[i]
                    e.args = e.args + ('occurred at index %s' %
                                       pprint_thing(k), )
            raise

    if len(results) > 0 and is_sequence(results[0]):
        if not isinstance(results[0], Series):
            index = res_columns
        else:
            index = None

        result = self._constructor(data=results, index=index)
        result.columns = res_index

        if axis == 1:
            result = result.T
        result = result._convert(datetime=True, timedelta=True, copy=False)

    else:

        result = Series(results)
        result.index = res_index

    return result

Specifically:

for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)

Where series_gen was constructed based on the requested axis.

To get more performance out of a function, you can follow the advice given here.

Essentially, your options are:

  1. Write a C extension
  2. Use numba (a JIT compiler)
  3. Use pandas.eval to squeeze performance out of large Dataframes
like image 184
juanpa.arrivillaga Avatar answered Sep 20 '22 08:09

juanpa.arrivillaga