When measuring memory consumption of <code>np.zeros</code>: <pre class="prettyprint"><code>import psutil import numpy as np process = psutil.Process() N=10**8 start_rss = process.memory_info().rss a = np.zeros(N, dtype=np.float64) print("memory for a", process.memory_info().rss - start_rss) </code></pre> the result is unexpected <code>8192</code> bytes, i.e almost 0, while 1e8 doubles would need 8e8 bytes. When replacing <code>np.zeros(N, dtype=np.float64)</code> by <code>np.full(N, 0.0, dtype=np.float64)</code> the memory needed for <code>a</code> are <code>800002048</code> bytes. There are similar discrepancies in running times: <pre class="prettyprint"><code>import numpy as np N=10**8 %timeit np.zeros(N, dtype=np.float64) # 11.8 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit np.full(N, 0.0, dtype=np.float64) # 419 ms ± 7.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) </code></pre> I.e. <code>np.zeros</code> is up to 40 times faster for big sizes. Not sure these differences are for all architectures/operating systems, but I've observed it at least for x86-64 Windows and Linux. Which differences between <code>np.zeros</code> and <code>np.full</code> can explain different memory consumption and different running times?

I don't trust <code>psutil</code> for these memory benchmarks, and rss (Resident Set Size) may not be the right metric in the first place. Using stdlib <code>tracemalloc</code> you can get correct looking numbers for memory allocation - it should be approx an 800000000 bytes delta for this N and float64 dtype: <pre class="prettyprint"><code>>>> import numpy as np >>> import tracemalloc >>> N = 10**8 >>> tracemalloc.start() >>> tracemalloc.get_traced_memory() # current, peak (159008, 1874350) >>> a = np.zeros(N, dtype=np.float64) >>> tracemalloc.get_traced_memory() (800336637, 802014880) </code></pre> For the timing differences between <code>np.full</code> and <code>np.zeros</code>, compare the man pages for <code>malloc</code> and <code>calloc</code>, i.e. the <code>np.zeros</code> is able to go to an allocation routine which gets zeroed pages. See <code>PyArray_Zeros</code> --> calls <code>PyArray_NewFromDescr_int</code> passing in <code>1</code> for the <code>zeroed</code> argument, which then has a special case for allocating zeros faster: <pre class="prettyprint"><code>if (zeroed || PyDataType_FLAGCHK(descr, NPY_NEEDS_INIT)) { data = npy_alloc_cache_zero(nbytes); } else { data = npy_alloc_cache(nbytes); } </code></pre> It looks like <code>np.full</code> does not have this fast path. There the performance will be similar to first doing an init and then doing a copy O(n): <pre class="prettyprint"><code>a = np.empty(N, dtype=np.float64) a[:] = np.float64(0.0) </code></pre> <code>numpy</code> devs could presumably have added a fast path to <code>np.full</code> if the fill value was zero, but why bother to add another way to do the same thing - users could just use <code>np.zeros</code> in the first place.

The numpy.zeros function straight uses the C code layer of the Numpy library while the functions ones and full works as same by initializing an array of values and copying the desired value in it. Then the zeros function doesn't need any language interpretation while for the others, ones and full, the Python code need to be interpreted as C code. Have a look on the source code to figure it out by yourself: https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py

Reasons for differences in memory consumption and performances of np.zeros and np.full

Tags:

performance

python

numpy

When measuring memory consumption of np.zeros:

import psutil
import numpy as np

process = psutil.Process()
N=10**8
start_rss = process.memory_info().rss
a = np.zeros(N, dtype=np.float64)
print("memory for a", process.memory_info().rss - start_rss)

the result is unexpected 8192 bytes, i.e almost 0, while 1e8 doubles would need 8e8 bytes.

When replacing np.zeros(N, dtype=np.float64) by np.full(N, 0.0, dtype=np.float64) the memory needed for a are 800002048 bytes.

There are similar discrepancies in running times:

import numpy as np
N=10**8
%timeit np.zeros(N, dtype=np.float64)
# 11.8 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.full(N, 0.0, dtype=np.float64)
# 419 ms ± 7.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I.e. np.zeros is up to 40 times faster for big sizes.

Not sure these differences are for all architectures/operating systems, but I've observed it at least for x86-64 Windows and Linux.

Which differences between np.zeros and np.full can explain different memory consumption and different running times?

581

asked Mar 11 '20 16:03

ead

2 Answers

I don't trust psutil for these memory benchmarks, and rss (Resident Set Size) may not be the right metric in the first place.

Using stdlib tracemalloc you can get correct looking numbers for memory allocation - it should be approx an 800000000 bytes delta for this N and float64 dtype:

>>> import numpy as np
>>> import tracemalloc
>>> N = 10**8
>>> tracemalloc.start()
>>> tracemalloc.get_traced_memory()  # current, peak
(159008, 1874350)
>>> a = np.zeros(N, dtype=np.float64)
>>> tracemalloc.get_traced_memory()
(800336637, 802014880)

For the timing differences between np.full and np.zeros, compare the man pages for malloc and calloc, i.e. the np.zeros is able to go to an allocation routine which gets zeroed pages. See PyArray_Zeros --> calls PyArray_NewFromDescr_int passing in 1 for the zeroed argument, which then has a special case for allocating zeros faster:

if (zeroed || PyDataType_FLAGCHK(descr, NPY_NEEDS_INIT)) {
    data = npy_alloc_cache_zero(nbytes);
}
else {
    data = npy_alloc_cache(nbytes);
}

It looks like np.full does not have this fast path. There the performance will be similar to first doing an init and then doing a copy O(n):

a = np.empty(N, dtype=np.float64)
a[:] = np.float64(0.0)

numpy devs could presumably have added a fast path to np.full if the fill value was zero, but why bother to add another way to do the same thing - users could just use np.zeros in the first place.

128

answered Sep 20 '22 13:09

wim

The numpy.zeros function straight uses the C code layer of the Numpy library while the functions ones and full works as same by initializing an array of values and copying the desired value in it.

Then the zeros function doesn't need any language interpretation while for the others, ones and full, the Python code need to be interpreted as C code.

Have a look on the source code to figure it out by yourself: https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py

answered Sep 18 '22 13:09

Laurent GRENIER

Related questions
                            
                                Python PyTorch Error: ModuleNotFoundError: No module named 'torch.utils.tensorboard'
                            
                                ValueError: All inputs to `ConcreteFunction`s must be Tensors
                            
                                Getting a numpy array view with integer or boolean indexing
                            
                                Keras sees my GPU but doesn't use it when training a neural network
                            
                                Getting pipenv internal error while trying to run this command " pipenv lock " with pycharm in mac OS
                            
                                unable to download the pipeline provided by spark-nlp library
                            
                                VSCode 1.39.x & Python 3.7.x: "ImportError: attempted relative import with no known parent package" - when started without debugging (CTRL+F5))
                            
                                Tensorflow: model wrapper that can release GPU resources
                            
                                Why do I get 'ValueError: NaTType does not support strftime' even though it's not empty?
                            
                                Itertools zip_longest with first item of each sub-list as padding values in stead of None by default
                            
                                Packing values into a tuple using *, just like function argument packing
                            
                                Is there a standard way to fail pytest if test coverage falls under x%
                            
                                Can conda-forge have priority over defaults while still installing MKL versions of packages?
                            
                                Python typing: typed dictionary or defaultdict extending classes
                            
                                How to avoid poor performance of pandas mean() with datetime columns
                            
                                How to use deep learning models for time-series forecasting?
                            
                                Include minimum pip version in setup.py
                            
                                How to make conda-build work correctly and find the setup.py?
                            
                                Animation of tangent line of a 3D curve
                            
                                os.link() vs. os.rename() vs. os.replace() for writing atomic write files. What is the best approach?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With