Memory consumption of NumPy function for standard deviation

Question

I'm currently using the Python bindings of GDAL to work on quite large raster data sets (> 4 GB). Since loading them into memory at once is no feasible solution for me I read them into smaller blocks and do the computations piece by piece. To avoid a new allocation for every block read I'm using the buf_obj argument (here) to read the values into an preallocated NumPy array. At one point I have to compute the mean and standard deviation of the entire raster. Naturally I've used np.std for the computation. However by profiling the memory consumption of my program I realized that with each invocation of np.std additionally memory is allocated and released.

A minimum working example which demonstrates this behavior:

In [1]  import numpy as np
In [2]  a = np.random.rand(20e6)  # Approx. 150 MiB of memory
In [3]  %memit np.mean(a)
peak memory: 187.30 MiB, increment: 0.48 MiB
In [4]  %memit np.std(a)
peak memory: 340.24 MiB, increment: 152.91 MiB

A search within the source tree of NumPy on GitHub revealed that the np.std function internally invokes the _var function from _methods.py (here). At one point _var computes the deviations from the mean and sums them up. Therefore an temporary copy of the input array is created. The function essentially computes the standard deviation as follows:

mu = sum(arr) / len(arr)
tmp = arr - mu
tmp = tmp * tmp
sd = np.sum(tmp) / len(arr)

While this approach is OK for smaller input arrays it's definitely no way to go for larger ones. Since I'm using smaller blocks of memory as mentioned before this additional copy is not an game-breaking issue from the memory point of view in my program. However what bugs me is that for each block a new allocation is made and released before reading the next block.

Is there some other function within NumPy or SciPy which utilizes an aproach with constant memory consumption like the Welford algorithm (Wikipedia) for one pass computation of the mean and standard deviation?

Another way to go would be to implement a custom version of the _var function with an optional out argument for a preallocated buffer (like NumPy ufuncs). With this approach the additional copy would not be eliminated but at least the memory consumption would be constant and runtime for the allocations in each block is saved.

EDIT: Tested the Cython implementation of the Welford algorithm as suggested by kezzos.

Cython implementation (modified from kezzos):

cimport cython
cimport numpy as np
from libc.math cimport sqrt

@cython.boundscheck(False)
def iterative_approach(np.ndarray[np.float32_t, ndim=1] a):
    cdef long n = 0
    cdef float mean = 0
    cdef float M2 = 0
    cdef long i
    cdef float delta
    cdef float a_min = 10000000  # Must be set to Inf and -Inf for real cases
    cdef float a_max = -10000000
    for i in range(len(a)):
        n += 1
        delta = a[i] - mean
        mean += delta / n
        M2 += delta * (a[i] - mean)
        if a[i] < a_min:
            a_min = a[i]
        if a[i] > a_max:
            a_max = a[i]
    return a_min, a_max, mean, sqrt(M2 / (n - 1))

NumPy implementation (mean and std can possibly be computed in one function):

def vector_approach(a):
    return np.min(a), np.max(a), np.mean(a), np.std(a, ddof=1)

Test results using a random data set (times in milliseconds, best of 25):

----------------------------------
| Size |  Iterative |     Vector |
----------------------------------
|  1e2 |    0.00529 |    0.17149 |
|  1e3 |    0.02027 |    0.16856 |
|  1e4 |    0.17850 |    0.23069 |
|  1e5 |    1.93980 |    0.77727 |
|  1e6 |   18.78207 |    8.83245 |
|  1e7 |  180.04069 |  101.14722 |
|  1e8 | 1789.60228 | 1086.66737 |
----------------------------------

It seems like the iterative approach using Cython is faster with smaller data sets and the NumPy vector (possibly SIMD accelerated) approach for larger data sets with 10000+ elements. All tests were conducted with Python 2.7.9 and NumPy version 1.9.2.

Note that in the real case to upper functions would be used to compute the statistics for a single block of the raster. The standard deviations and means for all block are to be combined with methodology suggested in Wikipedia (here). It has the advantage that not all elements of the raster need to be summed up and thereby avoids the float overflow problem (at least to some point).

Dunes · Accepted Answer

I doubt you will find any such functions in numpy. The raison d'être of numpy is that it takes advantage of vector processor instruction sets -- performing the same instruction of large amounts of data. Basically numpy trades memory efficiency for speed efficiency. However, due to the memory intensive nature of Python, numpy is also able to achieve certain memory efficiencies by associating the data type with the array as a whole and not each individual element.

One way to improve the speed, but still sacrifice some memory overhead is calculate the standard deviation in chunks eg.

import numpy as np

def std(arr, blocksize=1000000):
    """Written for py3, change range to xrange for py2.
    This implementation requires the entire array in memory, but it shows how you can
    calculate the standard deviation in a piecemeal way.
    """
    num_blocks, remainder = divmod(len(arr), blocksize)
    mean = arr.mean()
    tmp = np.empty(blocksize, dtype=float)
    total_squares = 0
    for start in range(0, blocksize*num_blocks, blocksize):
        # get a view of the data we want -- views do not "own" the data they point to
        # -- they have minimal memory overhead
        view = arr[start:start+blocksize]
        # # inplace operations prevent a new array from being created
        np.subtract(view, mean, out=tmp)
        tmp *= tmp
        total_squares += tmp.sum()
    if remainder:
        # len(arr) % blocksize != 0 and need process last part of array
        # create copy of view, with the smallest amount of new memory allocation possible
        # -- one more array *view*
        view = arr[-remainder:]
        tmp = tmp[-remainder:]
        np.subtract(view, mean, out=tmp)
        tmp *= tmp
        total_squares += tmp.sum()

    var = total_squares / len(arr)
    sd = var ** 0.5
    return sd

a = np.arange(20e6)
assert np.isclose(np.std(a), std(a))

Showing the speed up --- the larger the blocksize, the larger the speed up. And considerably lower memory overhead. Not entirely the lower memory overhead is 100% accurate.

In [70]: %timeit np.std(a)
10 loops, best of 3: 105 ms per loop

In [71]: %timeit std(a, blocksize=4096)
10 loops, best of 3: 160 ms per loop

In [72]: %timeit std(a, blocksize=1000000)
10 loops, best of 3: 105 ms per loop

In [73]: %memit std(a, blocksize=4096)
peak memory: 360.11 MiB, increment: 0.00 MiB

In [74]: %memit std(a, blocksize=1000000)
peak memory: 360.11 MiB, increment: 0.00 MiB

In [75]: %memit np.std(a)
peak memory: 512.70 MiB, increment: 152.59 MiB

Memory consumption of NumPy function for standard deviation

Tags:

python

memory

numpy

scipy

Stefan

1 Answers

Dunes

Recent Activity

Donate For Us

Memory consumption of NumPy function for standard deviation

Tags:

python

memory

numpy

scipy

Stefan

1 Answers

Dunes

Related questions

Recent Activity

Donate For Us