Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas: why map is faster?

Tags:

python

pandas

in pandas' manual, there is this example about indexing:

In [653]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [654]: df2[criterion]

then Wes wrote:

**# equivalent but slower**
In [655]: df2[[x.startswith('t') for x in df2['a']]]

can anyone here explain a bit why the map approach is faster? Is this a python feature or this is a pandas feature?

like image 792
James Bond Avatar asked Sep 21 '13 11:09

James Bond


1 Answers

Arguments about why a certain way of doing things in Python "should be" faster can't be taken too seriously, because you're often measuring implementation details which may behave differently in certain situations. As a result, when people guess what should be faster, they're often (usually?) wrong. For example, I find that map can actually be slower. Using this setup code:

import numpy as np, pandas as pd
import random, string

def make_test(num, width):
    s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)]
    df = pd.DataFrame({"a": s})
    return df

Let's compare the time they take to make the indexing object -- whether a Series or a list -- and the resulting time it takes to use that object to index into the DataFrame. It could be, for example, that making a list is fast but before using it as an index it needs to be internally converted to a Series or an ndarray or something and so there's extra time added there.

First, for a small frame:

>>> df = make_test(10, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10000 loops, best of 3: 85.8 µs per loop
>>> %timeit [x.startswith('t') for x in df['a']]
100000 loops, best of 3: 15.6 µs per loop
>>> %timeit df['a'].str.startswith("t")
10000 loops, best of 3: 118 µs per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 304 µs per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10000 loops, best of 3: 194 µs per loop
>>> %timeit df[df['a'].str.startswith("t")]
1000 loops, best of 3: 348 µs per loop

and in this case the listcomp is fastest. That doesn't actually surprise me too much, to be honest, because going via a lambda is likely to be slower than using str.startswith directly, but it's really hard to guess. 10 is small enough we're probably still measuring things like setup costs for Series; what happens in a larger frame?

>>> df = make_test(10**5, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 46.6 ms per loop
>>> %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 27.8 ms per loop
>>> %timeit df['a'].str.startswith("t")
10 loops, best of 3: 48.5 ms per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
10 loops, best of 3: 47.1 ms per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10 loops, best of 3: 52.8 ms per loop
>>> %timeit df[df['a'].str.startswith("t")]
10 loops, best of 3: 49.6 ms per loop

And now it seems like the map is winning when used as an index, although the difference is marginal. But not so fast: what if we manually turn the listcomp into an array or a Series?

>>> %timeit df[np.array([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 40.7 ms per loop
>>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 37.5 ms per loop

and now the listcomp wins again!

Conclusion: who knows? But never believe anything without timeit results, and even then you have to ask whether you're testing what you think you are.

like image 113
DSM Avatar answered Oct 02 '22 04:10

DSM