Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas mask / where methods versus NumPy np.where

I often use Pandas mask and where methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a significant performance drop relative to numpy.where.

While I'm happy to accept this for specific cases, I'm interested to know:

  1. Do Pandas mask / where methods offer any additional functionality, apart from inplace / errors / try-cast parameters? I understand those 3 parameters but rarely use them. For example, I have no idea what the level parameter refers to.
  2. Is there any non-trivial counter-example where mask / where outperforms numpy.where? If such an example exists, it could influence how I choose appropriate methods going forwards.

For reference, here's some benchmarking on Pandas 0.19.2 / Python 3.6.0:

np.random.seed(0)  n = 10000000 df = pd.DataFrame(np.random.random(n))  assert (df[0].mask(df[0] > 0.5, 1).values == np.where(df[0] > 0.5, 1, df[0])).all()  %timeit df[0].mask(df[0] > 0.5, 1)       # 145 ms per loop %timeit np.where(df[0] > 0.5, 1, df[0])  # 113 ms per loop 

The performance appears to diverge further for non-scalar values:

%timeit df[0].mask(df[0] > 0.5, df[0]*2)       # 338 ms per loop %timeit np.where(df[0] > 0.5, df[0]*2, df[0])  # 153 ms per loop 
like image 455
jpp Avatar asked Aug 23 '18 09:08

jpp


People also ask

Is Panda better than NumPy?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

Is Panda faster than NP?

NumPy can be said to be faster in performance than Pandas, up to fifty thousand rows and less of the dataset. (The performance between fifty thousand rows to five hundred thousand rows mostly depends on the type of operation Pandas, and NumPy are going to have to perform.)

Where is vs LOC panda?

loc retrieves only the rows that matches the condition. where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

Why would you use NumPy over pandas?

For Data Scientists, Pandas and Numpy are both essential tools in Python. We know Numpy runs vector and matrix operations very efficiently, while Pandas provides the R-like data frames allowing intuitive tabular data analysis. A consensus is that Numpy is more optimized for arithmetic computations.


1 Answers

I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.

But let's investigate a slightly different version of your second example (so we get2*df[0] out of the way). Here is our baseline on my machine:

twice = df[0]*2 mask = df[0] > 0.5 %timeit np.where(mask, twice, df[0])   # 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %timeit df[0].mask(mask, twice) # 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

Numpy's version is about 2.3 times faster than pandas.

So let's profile both functions to see the difference - profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what's going on just by reading the code.

I'm on Linux and use perf. For the numpy's version we get (for the listing see appendix A):

>>> perf record python np_where.py >>> perf report  Overhead  Command  Shared Object                                Symbol                                 68,50%  python   multiarray.cpython-36m-x86_64-linux-gnu.so   [.] PyArray_Where    8,96%  python   [unknown]                                    [k] 0xffffffff8140290c    1,57%  python   mtrand.cpython-36m-x86_64-linux-gnu.so       [.] rk_random 

As we can see, the lion's share of the time is spent in PyArray_Where - about 69%. The unknown symbol is a kernel function (as matter of fact clear_page) - I run without root privileges so the symbol is not resolved.

And for pandas we get (see Appendix B for code):

>>> perf record python pd_mask.py >>> perf report  Overhead  Command  Shared Object                                Symbol                                                                                                  37,12%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task   23,36%  python   libc-2.23.so                                 [.] __memmove_ssse3_back   19,78%  python   [unknown]                                    [k] 0xffffffff8140290c    3,32%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan    1,48%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not 

Quite a different situation:

  • pandas doesn't use PyArray_Where under the hood - the most prominent time-consumer is vm_engine_iter_task, which is numexpr-functionality.
  • there is some heavy memory-copying going on - __memmove_ssse3_back uses about 25% of time! Probably some of the kernel's functions are also connected to memory-accesses.

Actually, pandas-0.19 used PyArray_Where under the hood, for the older version the perf-report would look like:

Overhead  Command        Shared Object                     Symbol                                                                                                        32,42%  python         multiarray.so                     [.] PyArray_Where   30,25%  python         libc-2.23.so                      [.] __memmove_ssse3_back   21,31%  python         [kernel.kallsyms]                 [k] clear_page    1,72%  python         [kernel.kallsyms]                 [k] __schedule 

So basically it would use np.where under the hood + some overhead (all above data-copying, see __memmove_ssse3_back) back then.

I see no scenario where pandas could become faster than numpy in pandas' version 0.19 - it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story - here numexpr-module is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.

I'm not sure this memory-copying is really called for/necessary - maybe one even could call it performance-bug, but I just don't know enough to be certain.

We could help pandas not to copy, by peeling away some indirections (passing np.array instead of pd.Series). For example:

%timeit df[0].mask(mask.values > 0.5, twice.values) # 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

Now, pandas is only 25% slower. The perf says:

Overhead  Command  Shared Object                                Symbol                                                                                                   50,81%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task   14,12%  python   [unknown]                                    [k] 0xffffffff8140290c    9,93%  python   libc-2.23.so                                 [.] __memmove_ssse3_back    4,61%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan    2,01%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not 

Much less data-copying, but still more than in the numpy's version which is mostly responsible for the overhead.

My key take-aways from it:

  • pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.

  • when the performance of where/mask is the bottleneck, I would use numba/cython to improve the performance - see my rather naive tries to use numba and cython further below.


The idea is to take

np.where(df[0] > 0.5, df[0]*2, df[0]) 

version and to eliminate the need to create a temporary - i.e, df[0]*2.

As proposed by @max9111, using numba:

import numba as nb @nb.njit def nb_where(df):     n = len(df)     output = np.empty(n, dtype=np.float64)     for i in range(n):         if df[i]>0.5:             output[i] = 2.0*df[i]         else:             output[i] = df[i]     return output  assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all() %timeit np.where(df[0] > 0.5, df[0]*2, df[0]) # 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %timeit nb_where(df[0].values) # 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Which is about factor 5 faster than the numpy's version!

And here is my by far less successful try to improve the performance with help of Cython:

%%cython -a cimport numpy as np import numpy as np cimport cython  @cython.boundscheck(False) @cython.wraparound(False) def cy_where(double[::1] df):     cdef int i     cdef int n = len(df)     cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)     for i in range(n):         if df[i]>0.5:             output[i] = 2.0*df[i]         else:             output[i] = df[i]     return output  assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()  %timeit cy_where(df[0].values) # 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 

gives 25% speed-up. Not sure, why cython is so much slower than numba though.


Listings:

A: np_where.py:

import pandas as pd import numpy as np  np.random.seed(0)  n = 10000000 df = pd.DataFrame(np.random.random(n))  twice = df[0]*2 for _ in range(50):       np.where(df[0] > 0.5, twice, df[0])   

B: pd_mask.py:

import pandas as pd import numpy as np  np.random.seed(0)  n = 10000000 df = pd.DataFrame(np.random.random(n))  twice = df[0]*2 mask = df[0] > 0.5 for _ in range(50):       df[0].mask(mask, twice) 
like image 149
ead Avatar answered Oct 03 '22 14:10

ead