Why python broadcasting in the example below is slower than a simple loop?

Tags:

numpy

I have an array of vectors and compute the norm of their diffs vs the first one. When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?

import numpy as np

def norm_loop(M, v):
  n = M.shape[0]
  d = np.zeros(n)
  for i in range(n):
    d[i] = np.sum((M[i] - v)**2)
  return d

def norm_bcast(M, v):
  n = M.shape[0]
  d = np.zeros(n)
  d = np.sum((M - v)**2, axis=1)
  return d

M = np.random.random_sample((1000, 10000))
v = M[0]

%timeit norm_loop(M, v) 
25.9 ms

%timeit norm_bcast(M, v)
38.5 ms

I have Python 3.6.3 and Numpy 1.14.2

To run the example in google colab: https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing

857

asked Apr 03 '18 14:04

2 Answers

Memory access.

First off, the broadcast version can be simplified to

def norm_bcast(M, v):
     return np.sum((M - v)**2, axis=1)

This still runs slightly slower than the looped version. Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?

As I said, it comes down to memory access.

In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.

In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.

Lastly, it is possible to avoid this with some array operations by using einsum. This function allows mixing matrix multiplications and summations. First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it. The answer may also be slightly different due to rounding errors. In this case it can be written as

def norm_einsum(M, v):
    tmp = M-v
    return np.einsum('ij,ij->i', tmp, tmp)

This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation. This gives a slight improvement:

%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

186

answered Oct 19 '22 04:10

user2699

Squeezing out maximum performance

On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.

Example

import numpy as np
import numba as nb
@nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
  n = M.shape[0]
  d = np.empty(n,dtype=M.dtype)

  #enables SIMD-vectorization 
  #if the arrays are not aligned
  M=np.ascontiguousarray(M)
  v=np.ascontiguousarray(v)

  for i in nb.prange(n):
    dT=0.
    for j in range(v.shape[0]):
      dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
    d[i]=dT
  return d

Performance

M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms 
norm_einsum: 3.84 ms

M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms

** Be careful when measuring performance

The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.

To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.

The relative performance of this methods also differ on array sizes (have a look at the einsum results).

answered Oct 19 '22 05:10

max9111

Related questions
                            
                                How to create and then write on a xlsx file
                            
                                How to make predictions with tf.estimator.Estimator from checkpoint?
                            
                                Intersect multiple 2D np arrays for determining zones
                            
                                How to deal with DAG lib in airflow?
                            
                                Shutdown dask workers from client or scheduler
                            
                                Cmd Windows "python" command works, but "python3" doesn't although my python version is 3.6
                            
                                React Flask Heroku App is not displaying frontend
                            
                                Modifying class __dict__ when shadowed by a property
                            
                                How to download this video using Selenium
                            
                                How do you recursively get all submodules in a python package?
                            
                                In Python 3.6, why does a negative number to the power of a fraction return nan when in a numpy array?
                            
                                Slice pandas dataframe json column into columns
                            
                                Is there a way to get the error in fitting parameters from scipy.stats.norm.fit?
                            
                                Save jaw only as image with dlib facial landmark detection and the rest to be transparent
                            
                                Django - Form across multiple views with progress saving
                            
                                How does the `my_input_fn` in the getting started with TensorFlow allow enumeration over the data?
                            
                                Google colaboratory run code locally
                            
                                Change training dataset every N epochs in Keras
                            
                                Activating python virtual environment does not switch to local versions of pip and python commands
                            
                                Is it possible to run multiple instances of one selenium test at once?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why python broadcasting in the example below is slower than a simple loop?

Tags:

python

numpy

Vladi Raikhlin

People also ask

2 Answers

user2699

Squeezing out maximum performance

max9111

Recent Activity

Donate For Us