Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?

T(i) = Tm(i) + (T(i-1)-Tm(i))**(-tau(i)) 

Tm and tau are NumPy vectors of the same length that have been previously calculated, and the desire is to create a new vector T. The i is included only to indicate the element index for what is desired.

Is a for loop necessary for this case?

like image 644
tnt Avatar asked Dec 10 '10 10:12

tnt


People also ask

What does vectorize do in NumPy?

The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations. Example 1: Using vectorized sum method on NumPy array.

Why is NumPy vectorize fast?

Numpy arrays tout a performance (speed) feature called vectorization. The generally held impression among the scientific computing community is that vectorization is fast because it replaces the loop (running each item one by one) with something else that runs the operation on several items in parallel.

Are NumPy arrays passed by reference or value?

Python does not pass-by-reference nor pass-by-value. Some call it call-by-object.

What is vectorize in NumPy?

The NumPy vectorize accepts the hierarchical order of the numpy array or different objects as an input to the system and generates a single numpy array or multiple numpy arrays. After successive multiple arrays of input, the NumPy vectorize evaluates pyfunc like a python map function, and it helps to define numpy rules.

What is the best way to accelerate recursive functions in NumPy?

I performed some benchmarks and in 2019 using Numbais the first option people should try to accelerate recursive functions in Numpy (adjusted proposal of Aronstef). Numba is already preinstalled in the Anaconda package and has one of the fastest times (about 20 times faster than any Python).

Is it possible to vectorize operations on vectors in Numba?

1 @Ziofil New Numba broke the code. I adjusted the code so that numba can compile in nopython mode. Now it is even faster. – keiv.fly Dec 5 '19 at 15:15 Add a comment | 5 Update: 21-10-2018I have corrected my answer based on comments. It is possible to vectorize operations on vectors as long as the calculation is not recursive.

Is it possible to do recursion on RHS in NumPy?

but it doesn't: you can't actually do recursion in numpy this way (since numpy calculates the whole RHS and then assigns it to the LHS). So unless you can come up with a non-recursive version of this formula, you're stuck with an explicit loop:


2 Answers

You might think this would work:

import numpy as np n = len(Tm) t = np.empty(n)  t[0] = 0  # or whatever the initial condition is  t[1:] = Tm[1:] + (t[0:n-1] - Tm[1:])**(-tau[1:]) 

but it doesn't: you can't actually do recursion in numpy this way (since numpy calculates the whole RHS and then assigns it to the LHS).

So unless you can come up with a non-recursive version of this formula, you're stuck with an explicit loop:

tt = np.empty(n) tt[0] = 0. for i in range(1,n):     tt[i] = Tm[i] + (tt[i-1] - Tm[i])**(-tau[i]) 
like image 153
Andrew Jaffe Avatar answered Sep 18 '22 19:09

Andrew Jaffe


2019 Update. The Numba code broke with the new version of numba. Changing dtype="float32" to dtype=np.float32 solved it.

I performed some benchmarks and in 2019 using Numba is the first option people should try to accelerate recursive functions in Numpy (adjusted proposal of Aronstef). Numba is already preinstalled in the Anaconda package and has one of the fastest times (about 20 times faster than any Python). In 2019 Python supports @numba annotations without additional steps (at least versions 3.6, 3.7, and 3.8). Here are three benchmarks: performed on 2019-12-05, 2018-10-20 and 2016-05-18.

And, as mentioned by Jaffe, in 2018 it is still not possible to vectorize recursive functions. I checked the vectorization by Aronstef and it does NOT work.

Benchmarks sorted by execution time:

------------------------------------------- |Variant        |2019-12 |2018-10 |2016-05 | ------------------------------------------- |Pure C         |   na   |   na   | 2.75 ms| |C extension    |   na   |   na   | 6.22 ms| |Cython float32 | 0.55 ms| 1.01 ms|   na   | |Cython float64 | 0.54 ms| 1.05 ms| 6.26 ms| |Fortran f2py   | 4.65 ms|   na   | 6.78 ms| |Numba float32  |73.0  ms| 2.81 ms|   na   | |(Aronstef)     |        |        |        | |Numba float32v2| 1.82 ms| 2.81 ms|   na   | |Numba float64  |78.9  ms| 5.28 ms|   na   | |Numba float64v2| 4.49 ms| 5.28 ms|   na   | |Append to list |73.3  ms|48.2  ms|91.0  ms| |Using a.item() |36.9  ms|58.3  ms|74.4  ms| |np.fromiter()  |60.8  ms|60.0  ms|78.1  ms| |Loop over Numpy|71.3  ms|71.9  ms|87.9  ms| |(Jaffe)        |        |        |        | |Loop over Numpy|74.6  ms|74.4  ms|   na   | |(Aronstef)     |        |        |        | ------------------------------------------- 

Corresponding code is provided at the end of the answer.

It seems that with time Numba and Cython times get better. Now both of them are faster than Fortran f2py. Cython is faster 8.6 times now and Numba 32bit is faster 2.5 times. Fortran was very hard to debug and compile in 2016. So now there is no reason to use Fortran at all.

I did not check Pure C and C extension in 2019 and 2018, because it is not easy to compile them in Jupyter notebooks.

I had the following setup in 2019:

Processor: Intel i5-9600K 3.70GHz Versions: Python:  3.8.0 Numba:  0.46.0 Cython: 0.29.14 Numpy:  1.17.4 

I had the following setup in 2018:

Processor: Intel i7-7500U 2.7GHz Versions: Python:  3.7.0 Numba:  0.39.0 Cython: 0.28.5 Numpy:  1.15.1 

The recommended Numba code using float32 (adjusted Aronstef):

@numba.jit("float32[:](float32[:], float32[:])", nopython=True, nogil=True) def calc_py_jit32v2(Tm_, tau_):     tt = np.empty(len(Tm_),dtype=np.float32)     tt[0] = Tm_[0]     for i in range(1, len(Tm_)):         tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])     return tt[1:] 

All the other code:

Data creation (like Aronstef + Mike T comment):

np.random.seed(0) n = 100000 Tm = np.cumsum(np.random.uniform(0.1, 1, size=n).astype('float64')) tau = np.random.uniform(-1, 0, size=n).astype('float64') ar = np.column_stack([Tm,tau]) Tm32 = Tm.astype('float32') tau32 = tau.astype('float32') Tm_l = list(Tm) tau_l = list(tau) 

The code in 2016 was slightly different as I used abs() function to prevent nans and not the variant of Mike T. In 2018 the function is exactly the same as OP (Original Poster) wrote.

Cython float32 using Jupyter %% magic. The function can be used directly in Python. Cython needs a C++ compiler in which Python was compiled. Installation of the right version of Visual C++ compiler (for Windows) could be problematic:

%%cython  import cython import numpy as np cimport numpy as np from numpy cimport ndarray  cdef extern from "math.h":     np.float32_t exp(np.float32_t m)  @cython.boundscheck(False) @cython.wraparound(False) @cython.infer_types(True) @cython.initializedcheck(False)  def cy_loop32(np.float32_t[:] Tm,np.float32_t[:] tau,int alen):     cdef np.float32_t[:] T=np.empty(alen, dtype=np.float32)     cdef int i     T[0]=0.0     for i in range(1,alen):         T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])     return T 

Cython float64 using Jupyter %% magic. The function can be used directly in Python:

%%cython  cdef extern from "math.h":     double exp(double m) import cython import numpy as np cimport numpy as np from numpy cimport ndarray  @cython.boundscheck(False) @cython.wraparound(False) @cython.infer_types(True) @cython.initializedcheck(False)  def cy_loop(double[:] Tm,double[:] tau,int alen):     cdef double[:] T=np.empty(alen)     cdef int i     T[0]=0.0     for i in range(1,alen):         T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])     return T 

Numba float64:

@numba.jit("float64[:](float64[:], float64[:])", nopython=False, nogil=True) def calc_py_jitv2(Tm_, tau_):     tt = np.empty(len(Tm_),dtype=np.float64)     tt[0] = Tm_[0]     for i in range(1, len(Tm_)):         tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])     return tt[1:] 

Append to list. Fastest non-compiled solution:

def rec_py_loop(Tm,tau,alen):      T = [Tm[0]]      for i in range(1,alen):         T.append(Tm[i] - (T[i-1] + Tm[i])**(-tau[i]))      return np.array(T) 

Using a.item():

def rec_numpy_loop_item(Tm_,tau_):     n_ = len(Tm_)     tt=np.empty(n_)     Ti=tt.item     Tis=tt.itemset     Tmi=Tm_.item     taui=tau_.item     Tis(0,Tm_[0])     for i in range(1,n_):         Tis(i,Tmi(i) - (Ti(i-1) + Tmi(i))**(-taui(i)))     return tt[1:] 

np.fromiter():

def it(Tm,tau):     T=Tm[0]     i=0     while True:         yield T         i+=1         T=Tm[i] - (T + Tm[i])**(-tau[i])  def rec_numpy_iter(Tm,tau,alen):     return np.fromiter(it(Tm,tau), np.float64, alen)[1:] 

Loop over Numpy (based on the Jaffe's idea):

def rec_numpy_loop(Tm,tau,alen):     tt=np.empty(alen)     tt[0]=Tm[0]     for i in range(1,alen):         tt[i] = Tm[i] - (tt[i-1] + Tm[i])**(-tau[i])     return tt[1:] 

Loop over Numpy (Aronstef's code). On my computer float64 is the default type for np.empty.

def calc_py(Tm_, tau_):     tt = np.empty(len(Tm_),dtype="float64")     tt[0] = Tm_[0]     for i in range(1, len(Tm_)):         tt[i] = (Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i]))     return tt[1:] 

Pure C without using Python at all. Version from year 2016 (with fabs() function):

#include <stdio.h> #include <math.h> #include <stdlib.h> #include <windows.h> #include <sys\timeb.h>   double randn() {     double u = rand();     if (u > 0.5) {         return sqrt(-1.57079632679*log(1.0 - pow(2.0 * u - 1, 2)));     }     else {         return -sqrt(-1.57079632679*log(1.0 - pow(1 - 2.0 * u,2)));     } } void rec_pure_c(double *Tm, double *tau, int alen, double *T) {      for (int i = 1; i < alen; i++)     {         T[i] = Tm[i] + pow(fabs(T[i - 1] - Tm[i]), (-tau[i]));     } }  int main() {     int N = 100000;     double *Tm= calloc(N, sizeof *Tm);     double *tau = calloc(N, sizeof *tau);     double *T = calloc(N, sizeof *T);     double time = 0;     double sumtime = 0;     for (int i = 0; i < N; i++)     {         Tm[i] = randn();         tau[i] = randn();     }      LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;     LARGE_INTEGER Frequency;     for (int j = 0; j < 1000; j++)     {         for (int i = 0; i < 3; i++)         {             QueryPerformanceFrequency(&Frequency);             QueryPerformanceCounter(&StartingTime);              rec_pure_c(Tm, tau, N, T);              QueryPerformanceCounter(&EndingTime);             ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;             ElapsedMicroseconds.QuadPart *= 1000000;             ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;             if (i == 0)                 time = (double)ElapsedMicroseconds.QuadPart / 1000;             else {                 if (time > (double)ElapsedMicroseconds.QuadPart / 1000)                     time = (double)ElapsedMicroseconds.QuadPart / 1000;             }         }         sumtime += time;     }     printf("1000 loops,best of 3: %.3f ms per loop\n",sumtime/1000);      free(Tm);     free(tau);     free(T); } 

Fortran f2py. Function can be used from Python. Version from year 2016 (with abs() function):

subroutine rec_fortran(tm,tau,alen,result)     integer*8, intent(in) :: alen     real*8, dimension(alen), intent(in) :: tm     real*8, dimension(alen), intent(in) :: tau     real*8, dimension(alen) :: res     real*8, dimension(alen), intent(out) :: result      res(1)=0     do i=2,alen         res(i) = tm(i) + (abs(res(i-1) - tm(i)))**(-tau(i))     end do     result=res     end subroutine rec_fortran 
like image 36
keiv.fly Avatar answered Sep 20 '22 19:09

keiv.fly