I have a huge image dataset that does not fit in memory. I want to compute the <code>mean</code> and <code>standard deviation</code>, loading images from disk. I'm currently trying to use this algorithm found on wikipedia. <pre class="prettyprint"><code># for a new value newValue, compute the new count, new mean, the new M2. # mean accumulates the mean of the entire dataset # M2 aggregates the squared distance from the mean # count aggregates the amount of samples seen so far def update(existingAggregate, newValue): (count, mean, M2) = existingAggregate count = count + 1 delta = newValue - mean mean = mean + delta / count delta2 = newValue - mean M2 = M2 + delta * delta2 return existingAggregate # retrieve the mean and variance from an aggregate def finalize(existingAggregate): (count, mean, M2) = existingAggregate (mean, variance) = (mean, M2/(count - 1)) if count < 2: return float('nan') else: return (mean, variance) </code></pre> This is my current implementation (computing just for the red channel): <pre class="prettyprint"><code>count = 0 mean = 0 delta = 0 delta2 = 0 M2 = 0 for i, file in enumerate(tqdm(first)): image = cv2.imread(file) for i in range(224): for j in range(224): r, g, b = image[i, j, :] newValue = r count = count + 1 delta = newValue - mean mean = mean + delta / count delta2 = newValue - mean M2 = M2 + delta * delta2 print('first mean', mean) print('first std', np.sqrt(M2 / (count - 1))) </code></pre> This implementation works close enough on a subset of the dataset I tried. The problem is that it is extremely slow and therefore nonviable. <ul> <li>Is there a standard way of doing this?</li> <li>How can I adapt this for faster result or compute the RGB mean and standard deviation for all the dataset without loading it all in memory at the same time and at reasonable speed?</li> </ul>

Since this is a numerically heavy task (a lot of iterations around a matrix, or a tensor), I always suggest to use libraries that are good at this: numpy. A properly installed numpy should be able to utilize the underlying BLAS (Basic Linear Algebra Subroutines) routines which are optimized for operating an array of floating points from the memory hierarchy perspective. imread should already give you the numpy array. You can get the reshaped 1d array of the image of the red channel by <pre class="prettyprint"><code>import numpy as np val = np.reshape(image[:,:,0], -1) </code></pre> the mean of such by <pre class="prettyprint"><code>np.mean(val) </code></pre> and the standard deviation by <pre class="prettyprint"><code>np.std(val) </code></pre> In this way, you can get rid of two layers of python loops: <pre class="prettyprint"><code>count = 0 mean = 0 delta = 0 delta2 = 0 M2 = 0 for i, file in enumerate(tqdm(first)): image = cv2.imread(file) val = np.reshape(image[:,:,0], -1) img_mean = np.mean(val) img_std = np.std(val) ... </code></pre> The rest of the incremental update should be straightforward. Once you have done this, the bottleneck will become the image loading speed, which is limited by disk read operation performance. For that regard, I suspect using multi-thread as others suggested will help much based on my prior experience.

You can use also opencv's method meanstddev. <pre class="prettyprint"><code>cv2.meanStdDev(src[, mean[, stddev[, mask]]]) → mean, stddev </code></pre>

Fastest way to compute image dataset channel wise mean and standard deviation in Python

Tags:

python

opencv

computer-vision

I have a huge image dataset that does not fit in memory. I want to compute the mean and standard deviation, loading images from disk.

I'm currently trying to use this algorithm found on wikipedia.

# for a new value newValue, compute the new count, new mean, the new M2.
# mean accumulates the mean of the entire dataset
# M2 aggregates the squared distance from the mean
# count aggregates the amount of samples seen so far
def update(existingAggregate, newValue):
    (count, mean, M2) = existingAggregate
    count = count + 1 
    delta = newValue - mean
    mean = mean + delta / count
    delta2 = newValue - mean
    M2 = M2 + delta * delta2

    return existingAggregate

# retrieve the mean and variance from an aggregate
def finalize(existingAggregate):
    (count, mean, M2) = existingAggregate
    (mean, variance) = (mean, M2/(count - 1)) 
    if count < 2:
        return float('nan')
    else:
        return (mean, variance)

This is my current implementation (computing just for the red channel):

count = 0
mean = 0
delta = 0
delta2 = 0
M2 = 0
for i, file in enumerate(tqdm(first)):
    image = cv2.imread(file)
    for i in range(224):
        for j in range(224):
            r, g, b = image[i, j, :]
            newValue = r
            count = count + 1
            delta = newValue - mean
            mean = mean + delta / count
            delta2 = newValue - mean
            M2 = M2 + delta * delta2

print('first mean', mean)
print('first std', np.sqrt(M2 / (count - 1)))

This implementation works close enough on a subset of the dataset I tried.

The problem is that it is extremely slow and therefore nonviable.

Is there a standard way of doing this?
How can I adapt this for faster result or compute the RGB mean and standard deviation for all the dataset without loading it all in memory at the same time and at reasonable speed?

614

asked Dec 16 '17 21:12

Bruno Klein

2 Answers

Since this is a numerically heavy task (a lot of iterations around a matrix, or a tensor), I always suggest to use libraries that are good at this: numpy.

A properly installed numpy should be able to utilize the underlying BLAS (Basic Linear Algebra Subroutines) routines which are optimized for operating an array of floating points from the memory hierarchy perspective.

imread should already give you the numpy array. You can get the reshaped 1d array of the image of the red channel by

import numpy as np
val = np.reshape(image[:,:,0], -1)

the mean of such by

np.mean(val)

and the standard deviation by

np.std(val)

In this way, you can get rid of two layers of python loops:

count = 0
mean = 0
delta = 0
delta2 = 0
M2 = 0
for i, file in enumerate(tqdm(first)):
    image = cv2.imread(file)
        val = np.reshape(image[:,:,0], -1)
        img_mean = np.mean(val)
        img_std = np.std(val)
        ...

The rest of the incremental update should be straightforward.

Once you have done this, the bottleneck will become the image loading speed, which is limited by disk read operation performance. For that regard, I suspect using multi-thread as others suggested will help much based on my prior experience.

162

answered Sep 18 '22 18:09

Yo Hsiao

You can use also opencv's method meanstddev.

cv2.meanStdDev(src[, mean[, stddev[, mask]]]) → mean, stddev

answered Sep 20 '22 18:09

Andrey Smorodov

Related questions
                            
                                Pass a custom queryset to serializer in Django Rest Framework
                            
                                How to specify large integer literals in a readable way?
                            
                                Classifying Python array by nearest "seed" region?
                            
                                TensorFlow MNIST example not running with fully_connected_feed.py
                            
                                Pandas error "Can only use .str accessor with string values"
                            
                                How to get numpy array of RGB colors from pygame.surface
                            
                                How to monitor convergence of Gensim LDA model?
                            
                                What is the Python equivalent of Java's UnsupportedOperationException?
                            
                                Likelihood ratio test in Python
                            
                                itertools not defined when used inside module
                            
                                ZipFile.testzip() returning different results on Python 2 and Python 3
                            
                                Pip: could not find a version. No matching distribution found
                            
                                Wrapping column names in Python Pandas DataFrame or Jupyter Notebooks
                            
                                How to convert a series of one value to float only?
                            
                                Airflow no module named for directory in airflow_home directory
                            
                                TypeError: src is not a numpy array, neither a scalar
                            
                                Can you overload the Python 3.6 f-string's "operator"?
                            
                                Understanding apyori's output
                            
                                size legend for plotly bubble map/chart
                            
                                Find (only) the first row satisfying a given condition in pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With