Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating stats across 1000 arrays

I am writing a python module that needs to calculate the mean and standard deviation of pixel values across 1000+ arrays (identical dimensions).

I am looking for the fastest way to do this.

Currently I am looping through the arrays and using numpy.dstack to stack the 1000 arrays into a rather large 3d array...and then will calculate the mean across the 3rd(?) dimension. Each array has shape (5000,4000).

This approach is taking quite a long time!

Would anyone be able to advise on a more efficient method of approaching this problem?

like image 897
Becky Avatar asked Oct 22 '22 02:10

Becky


1 Answers

Maybe you could calculate mean and std in a cumulative way something like this (untested):

im_size = (5000,4000)

cum_sum = np.zeros(im_size)
cum_sum_of_squares = np.zeros(im_size)
n = 0

for filename in filenames:
    image = read_your_image(filename)
    cum_sum += image
    cum_sum_of_squares += image**2
    n += 1

mean_image = cum_sum / n
std_image = np.sqrt(cum_sum_of_squares / n - (mean_image)**2)

This is probably limited by how fast you can read images from disk. It is not limited by memory, since you only have one image in memory at a time. The calculation of std in this way might suffer from numerical problems, since you might be subtracting two large numbers. If that is a problem you have to loop over the files twice, first to calculate the mean and then accumulate (image - mean_image)**2 in the second pass.

like image 121
Bas Swinckels Avatar answered Oct 24 '22 04:10

Bas Swinckels