Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up Pandas apply function

For a relatively big Pandas DataFrame (a few 100k rows), I'd like to create a series that is a result of an apply function. The problem is that the function is not very fast and I was hoping that it can be sped up somehow.

df = pd.DataFrame({
 'value-1': [1, 2, 3, 4, 5],
 'value-2': [0.1, 0.2, 0.3, 0.4, 0.5],
 'value-3': somenumbers...,
 'value-4': more numbers...,
 'choice-index': [1, 1, np.nan, 2, 1]
})

def func(row):
  i = row['choice-index']
  return np.nan if math.isnan(i) else row['value-%d' % i]

df['value'] = df.apply(func, axis=1, reduce=True)

# expected value = [1, 2, np.nan, 0.4, 5]

Any suggestions are welcome.

Update

A very small speedup (~1.1) can be achieved by pre-caching the selected columns. func would change to:

cached_columns = [None, 'value-1', 'value-2', 'value-3', 'value-4']
def func(row):
  i = row['choice-index']
  return np.nan if math.isnan(i) else row[cached_columns[i]]

But I was hoping for greater speedups...

like image 692
orange Avatar asked Jul 12 '15 02:07

orange


People also ask

How can I make my pandas 100x faster?

If your function is I/O bound, meaning that it is spending a lot of time waiting for data (e.g. making api requests over the internet), then multithreading (or thread pool) will be the better and faster option.

Is pandas apply faster than list comprehension?

Using List comprehensions is way faster than a normal for loop. Reason which is given for this is that there is no need of append in list comprehensions, which is understandable. But I have found at various places that list comparisons are faster than apply. I have experienced that as well.

Is apply or map faster?

Series Map: This is actually somewhat faster than Series Apply, but still relatively slow.


1 Answers

I think I got a good solution (speedup ~150).

The trick is not to use apply, but to do smart selections.

choice_indices = [1, 2, 3, 4]
for idx in choice_indices:
  mask = df['choice-index'] == idx
  result_column = 'value-%d' % (idx)
  df.loc[mask, 'value'] = df.loc[mask, result_column]
like image 93
orange Avatar answered Oct 13 '22 23:10

orange