I finally found a performance bottleneck in my code but am confused as to what the reason is. To solve it I changed all my calls of <code>numpy.zeros_like</code> to instead use <code>numpy.zeros</code>. But why is <code>zeros_like</code> sooooo much slower? For example (note <code>e-05</code> on the <code>zeros</code> call): <pre class="prettyprint"><code>>>> timeit.timeit('np.zeros((12488, 7588, 3), np.uint8)', 'import numpy as np', number = 10) 5.2928924560546875e-05 >>> timeit.timeit('np.zeros_like(x)', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10) 1.4402990341186523 </code></pre> But then strangely writing to an array created with <code>zeros</code> is noticeably slower than an array created with <code>zeros_like</code>: <pre class="prettyprint"><code>>>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10) 0.4310588836669922 >>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros_like(np.zeros((12488, 7588, 3), np.uint8))', number = 10) 0.33325695991516113 </code></pre> My guess is <code>zeros</code> is using some CPU trick and not actually writing to the memory to allocate it. This is done on the fly when it's written to. But that still doesn't explain the massive discrepancy in array creation times. I'm running Mac OS X Yosemite with the current numpy version: <pre class="prettyprint"><code>>>> numpy.__version__ '1.9.1' </code></pre>

My timings in Ipython are (with a simplier timeit interface): <pre class="prettyprint"><code>In [57]: timeit np.zeros_like(x) 1 loops, best of 3: 420 ms per loop In [58]: timeit np.zeros((12488, 7588, 3), np.uint8) 100000 loops, best of 3: 15.1 µs per loop </code></pre> When I look at the code with IPython (<code>np.zeros_like??</code>) I see: <pre class="prettyprint"><code>res = empty_like(a, dtype=dtype, order=order, subok=subok) multiarray.copyto(res, 0, casting='unsafe') </code></pre> while <code>np.zeros</code> is a blackbox - pure compiled code. Timings for <code>empty</code> are: <pre class="prettyprint"><code>In [63]: timeit np.empty_like(x) 100000 loops, best of 3: 13.6 µs per loop In [64]: timeit np.empty((12488, 7588, 3), np.uint8) 100000 loops, best of 3: 14.9 µs per loop </code></pre> So the extra time in <code>zeros_like</code> is in that <code>copy</code>. In my tests, the difference in assignment times (<code>x[]=1</code>) is negligible. My guess is that <code>zeros</code>, <code>ones</code>, <code>empty</code> are all early compiled creations. <code>empty_like</code> was added as a convenience, just drawing shape and type info from its input. <code>zeros_like</code> was written with more of an eye toward easy programming maintenance (reusing <code>empty_like</code>) than for speed. <code>np.ones</code> and <code>np.full</code> also use the <code>np.empty ... copyto</code> sequence, and show similar timings. <hr> https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/array_assign_scalar.c appears to be file that copies a scalar (such as <code>0</code>) to an array. I don't see a use of <code>memset</code>. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c has calls to <code>malloc</code> and <code>calloc</code>. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c - source for <code>zeros</code> and <code>empty</code>. Both call <code>PyArray_NewFromDescr_int</code>, but one ends up using <code>npy_alloc_cache_zero</code> and the other <code>npy_alloc_cache</code>. <code>npy_alloc_cache</code> in <code>alloc.c</code> calls <code>alloc</code>. <code>npy_alloc_cache_zero</code> calls <code>npy_alloc_cache</code> followed by a <code>memset</code>. Code in <code>alloc.c</code> is further confused with a THREAD option. More on the <code>calloc</code> v <code>malloc+memset</code> difference at: Why malloc+memset is slower than calloc? But with caching and garbage collection, I wonder whether the <code>calloc/memset</code> distinction applies. <hr> This simple test with the <code>memory_profile</code> package supports the claim that <code>zeros</code> and <code>empty</code> allocate memory 'on-the-fly', while <code>zeros_like</code> allocates everything up front: <pre class="prettyprint"><code>N = (1000, 1000) M = (slice(None, 500, None), slice(500, None, None)) Line # Mem usage Increment Line Contents ================================================ 2 17.699 MiB 0.000 MiB @profile 3 def test1(N, M): 4 17.699 MiB 0.000 MiB print(N, M) 5 17.699 MiB 0.000 MiB x = np.zeros(N) # no memory jump 6 17.699 MiB 0.000 MiB y = np.empty(N) 7 25.230 MiB 7.531 MiB z = np.zeros_like(x) # initial jump 8 29.098 MiB 3.867 MiB x[M] = 1 # jump on usage 9 32.965 MiB 3.867 MiB y[M] = 1 10 32.965 MiB 0.000 MiB z[M] = 1 11 32.965 MiB 0.000 MiB return x,y,z </code></pre>

Why the performance difference between numpy.zeros and numpy.zeros_like?

Tags:

I finally found a performance bottleneck in my code but am confused as to what the reason is. To solve it I changed all my calls of numpy.zeros_like to instead use numpy.zeros. But why is zeros_like sooooo much slower?

For example (note e-05 on the zeros call):

>>> timeit.timeit('np.zeros((12488, 7588, 3), np.uint8)', 'import numpy as np', number = 10) 5.2928924560546875e-05 >>> timeit.timeit('np.zeros_like(x)', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10) 1.4402990341186523

But then strangely writing to an array created with zeros is noticeably slower than an array created with zeros_like:

>>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10) 0.4310588836669922 >>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros_like(np.zeros((12488, 7588, 3), np.uint8))', number = 10) 0.33325695991516113

My guess is zeros is using some CPU trick and not actually writing to the memory to allocate it. This is done on the fly when it's written to. But that still doesn't explain the massive discrepancy in array creation times.

I'm running Mac OS X Yosemite with the current numpy version:

>>> numpy.__version__ '1.9.1'

291

asked Dec 13 '14 21:12

Damon Maria

2 Answers

Modern OS allocate memory virtually, ie., memory is given to a process only when it is first used. zeros obtains memory from the operating system so that the OS zeroes it when it is first used. zeros_like on the other hand fills the alloced memory with zeros by itself. Both ways require about same amount of work --- it's just that with zeros_like the zeroing is done upfront, whereas zeros ends up doing it on the fly.

Technically, in C the difference is calling calloc vs. malloc+memset.

144

answered Oct 16 '22 08:10

pv.

My timings in Ipython are (with a simplier timeit interface):

In [57]: timeit np.zeros_like(x) 1 loops, best of 3: 420 ms per loop  In [58]: timeit np.zeros((12488, 7588, 3), np.uint8) 100000 loops, best of 3: 15.1 µs per loop

When I look at the code with IPython (np.zeros_like??) I see:

res = empty_like(a, dtype=dtype, order=order, subok=subok) multiarray.copyto(res, 0, casting='unsafe')

while np.zeros is a blackbox - pure compiled code.

Timings for empty are:

In [63]: timeit np.empty_like(x) 100000 loops, best of 3: 13.6 µs per loop  In [64]: timeit np.empty((12488, 7588, 3), np.uint8) 100000 loops, best of 3: 14.9 µs per loop

So the extra time in zeros_like is in that copy.

In my tests, the difference in assignment times (x[]=1) is negligible.

My guess is that zeros, ones, empty are all early compiled creations. empty_like was added as a convenience, just drawing shape and type info from its input. zeros_like was written with more of an eye toward easy programming maintenance (reusing empty_like) than for speed.

np.ones and np.full also use the np.empty ... copyto sequence, and show similar timings.

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/array_assign_scalar.c appears to be file that copies a scalar (such as 0) to an array. I don't see a use of memset.

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c has calls to malloc and calloc.

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c - source for zeros and empty. Both call PyArray_NewFromDescr_int, but one ends up using npy_alloc_cache_zero and the other npy_alloc_cache.

npy_alloc_cache in alloc.c calls alloc. npy_alloc_cache_zero calls npy_alloc_cache followed by a memset. Code in alloc.c is further confused with a THREAD option.

More on the calloc v malloc+memset difference at: Why malloc+memset is slower than calloc?

But with caching and garbage collection, I wonder whether the calloc/memset distinction applies.

This simple test with the memory_profile package supports the claim that zeros and empty allocate memory 'on-the-fly', while zeros_like allocates everything up front:

N = (1000, 1000)  M = (slice(None, 500, None), slice(500, None, None))  Line #    Mem usage    Increment   Line Contents ================================================      2   17.699 MiB    0.000 MiB   @profile      3                             def test1(N, M):      4   17.699 MiB    0.000 MiB       print(N, M)      5   17.699 MiB    0.000 MiB       x = np.zeros(N)   # no memory jump      6   17.699 MiB    0.000 MiB       y = np.empty(N)      7   25.230 MiB    7.531 MiB       z = np.zeros_like(x) # initial jump      8   29.098 MiB    3.867 MiB       x[M] = 1     # jump on usage      9   32.965 MiB    3.867 MiB       y[M] = 1     10   32.965 MiB    0.000 MiB       z[M] = 1     11   32.965 MiB    0.000 MiB       return x,y,z

answered Oct 16 '22 08:10

hpaulj

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why the performance difference between numpy.zeros and numpy.zeros_like?

Tags:

Damon Maria

People also ask

2 Answers

pv.

hpaulj

Recent Activity

Donate For Us

Why the performance difference between numpy.zeros and numpy.zeros_like?

Tags:

Damon Maria

People also ask

2 Answers

pv.

hpaulj

Related questions

Recent Activity

Donate For Us