Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Python loops over slices of numpy arrays are faster than fully vectorized operations

Tags:

python

numpy

I need to create a boolean mask by thresholding a 3D data array: mask at locations where data are smaller than lower acceptable limit or data are larger than upper acceptable limit must be set to True (otherwise False). Succinctly:

mask = (data < low) or (data > high)

I have two versions of the code for performing this operation: one works directly with entire 3D arrays in numpy while the other method loops over slices of the array. Contrary to my expectations, the second method seems to be faster than the first one. Why???

In [1]: import numpy as np

In [2]: import sys

In [3]: print(sys.version)
3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)
1.14.0

In [5]: arr = np.random.random((10, 1000, 1000))

In [6]: def method1(arr, low, high):
   ...:     """ Fully vectorized computations """
   ...:     out = np.empty(arr.shape, dtype=np.bool)
   ...:     np.greater_equal(arr, high, out)
   ...:     np.logical_or(out, arr < low, out)
   ...:     return out
   ...: 

In [7]: def method2(arr, low, high):
   ...:     """ Partially vectorized computations """
   ...:     out = np.empty(arr.shape, dtype=np.bool)
   ...:     for k in range(arr.shape[0]):
   ...:         a = arr[k]
   ...:         o = out[k]
   ...:         np.greater_equal(a, high, o)
   ...:         np.logical_or(o, a < low, o)
   ...:     return out
   ...: 

First of all, let's make sure that both methods produce identical results:

In [8]: np.all(method1(arr, 0.2, 0.8) == method2(arr, 0.2, 0.8))
Out[8]: True

And now some timing tests:

In [9]: %timeit method1(arr, 0.2, 0.8)
14.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit method2(arr, 0.2, 0.8)
11.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What is going on here?


EDIT 1: A similar behavior is observed in an older environment:

In [3]: print(sys.version)
2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)
1.11.3

In [9]: %timeit method1(arr, 0.2, 0.8)
100 loops, best of 3: 14.3 ms per loop

In [10]: %timeit method2(arr, 0.2, 0.8)
100 loops, best of 3: 13 ms per loop
like image 435
AGN Gazer Avatar asked Feb 20 '18 14:02

AGN Gazer


1 Answers

Outperforming both methods

In method one you are accessing the array twice. If it doesn't fit in the cache, the data will be read two times from RAM which lowers the performance. Additionally it is possible that temporary arrays are created as mentioned in the comments.

Method two is more cache friendly, since you are accessing a smaller part of the array twice, which is likely to fit in cache. The downsides are slow looping and more function calls, which are also quite slow.

To get a good performance here it is recommended to compile the code, which can be done using cython or numba. Since the cython version is some more work (annotating, need of a seperate compiler), I will show how to do this using Numba.

import numba as nb
@nb.njit(fastmath=True, cache=True)
def method3(arr, low, high):
  out = np.empty(arr.shape, dtype=nb.boolean)
  for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
      for k in range(arr.shape[2]):
        out[i,j,k]=arr[i,j,k] < low or arr[i,j,k] > high
  return out

Using arr = np.random.random((10, 1000, 1000)) this outperforms your method_1 by a factor of two and your method_2 by 50 percent on my PC (Core i7-4771, python 3.5, windows)

This is only a simple example, on more complex code, where you can make use of SIMD, and parallel processing which is also very easy to use the performance gain can be a lot bigger. On non compiled code vectorization is usualy but not always (as shown) the best you can do, but it will always lead to a bad cache behaiviour which can lead to suboptimal performance if the chunks of data you are acessing don't fit at least in L3 cache. On some other problems there will be also a performance hit if the data can't fit in the much smaller L1 or L2 cache. Another advantage will be automatic inlining of small njited functions in a njited function which calls this functions.

like image 103
max9111 Avatar answered Nov 02 '22 09:11

max9111