Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens in numpy's log function? Are there ways to improve the performance?

I have a computation project with heavy use of log function (for integers), billions of calls. I find the performance of numpy's log is surprisingly slow.

The following code takes 15 to 17 secs to complete:

import numpy as np
import time

t1 = time.time()
for i in range(1,10000000): 
    np.log(i)
t2 = time.time()
print(t2 - t1)

However, the math.log function takes much less time from 3 to 4 seconds.

import math
import time

t1 = time.time()
for i in range(1,10000000): 
    math.log(i)
t2 = time.time()
print(t2 - t1)

I also tested matlab and C#, which takes about 2 secs and just 0.3 secs respectively.

matlab

tic
for i = 1:10000000
    log(i);
end
toc

C#

var t = DateTime.Now;
for (int i = 1; i < 10000000; ++i)
     Math.Log(i);
Console.WriteLine((DateTime.Now - t).TotalSeconds);

Is there any way in python that I can improve the performance of log function?

like image 963
Tony Avatar asked May 10 '17 12:05

Tony


2 Answers

NumPys functions are designed for arrays not for single values or scalars. They have a rather high overhead because they do several checks and conversions that will provide a speed benefit for big arrays but these are costly for scalars.

The conversion is really obvious if you check the type of the return:

>>> import numpy as np
>>> import math

>>> type(np.log(2.))
numpy.float64
>>> type(math.log(2.))
float

On the other hand the math-module is optimized for scalars. So they don't need that many checks (I think there are only two: Convert to float and check is it's <= 0). Which is why math.log is faster for scalars compared to numpy.log.

But if you operate on arrays and want to take the logarithm of all elements in the array NumPy can be much faster. On my computer if I time the execution of np.log on an array compared to math.log of each item in a list then the timing looks different:

arr = np.arange(1, 10000000)
%timeit np.log(arr)
201 ms ± 959 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

lst = arr.tolist()
%timeit [math.log(item) for item in lst]
8.77 s ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So np.log will be many orders of magnitude faster on arrays (it's more than 40 times faster in this case)! And you don't need to write any loop yourself. As ufunc np.log will also correctly work on multidimensional numpy arrays and also allows to do the operation inplace.

As a rule of thumb: If you have an array with thousands of items NumPy will be faster, if you have scalars or only a few dozen items math + explicit loop will be faster.


Also don't use time for timing code. There are dedicated modules that give more accurate results, better statistics and disable garbage collection during the timings:

  • timeit (built-in)
  • perf (extension package)

I generally use %timeit which is a convenient wrapper around the timeit functionality, but it requires IPython. They already conveniently display the result mean and deviation and do some (mostly) useful statistics like displaying the "best of 7" or "best of 3" result.


I recently analyzed the runtime behaviour of numpy functions for another question, some of the points also apply here.

like image 109
MSeifert Avatar answered Oct 14 '22 17:10

MSeifert


Interestingly, the issue of the python standard library being slow doesn't replicate on my machine (Windows 10, running Python 2.7.11 and numpy 1.11.0).

>>> t1 = time.time()
>>> for i in range(1,10000000): 
>>>     _ = np.log(i)
>>> t2 = time.time()
>>> print(t2 - t1)
9.86099982262
>>> t1 = time.time()
>>> for i in range(1,10000000): 
>>>     _ = math.log(i)
>>> t2 = time.time()
>>> print(t2 - t1)
2.48300004005

Similar to your performance in Matlab. @Nils raises a good point though, numpy is designed to be efficient on arrays.

>>> t1 = time.time()
>>> for i in range(1,1000): 
>>>     _ = np.log(np.arange(1,10000))
>>> t2 = time.time()
>>> print(t2 - t1)
0.146000146866
>>> t1 = time.time()
>>> for i in range(1,1000): 
>>>     _ = [math.log(i) for i in range(1,10000)]
>>> t2 = time.time()
>>> print(t2 - t1)
2.3220000267

If you can vectorize your input, numpy will outperform the standard math library and even come close to C#.

like image 24
Scott Gigante Avatar answered Oct 14 '22 18:10

Scott Gigante