I'm currently trying to calculate the sum of all sum of subsquares in a 10.000 x 10.000 array of values. As an example, if my array was :
1 1 1
2 2 2
3 3 3
I want the result to be :
1+1+1+2+2+2+3+3+3 [sum of squares of size 1]
+(1+1+2+2)+(1+1+2+2)+(2+2+3+3)+(2+2+3+3) [sum of squares of size 2]
+(1+1+1+2+2+2+3+3+3) [sum of squares of size 3]
________________________________________
68
So, as a first try i wrote a very simple python code to do that. As it was in O(k^2.n^2) (n being the size of the big array and k the size of the subsquares we are getting), the processing was awfully long. I wrote another algorithm in O(n^2) to speed it up :
def getSum(tab,size):
n = len(tab)
tmp = numpy.zeros((n,n))
for i in xrange(0,n):
sum = 0
for j in xrange(0,size):
sum += tab[j][i]
tmp[0][i] = sum
for j in xrange(1,n-size+1):
sum += (tab[j+size-1][i] - tab[j-1][i])
tmp[j][i] = sum
finalsum = 0
for i in xrange(0,n-size+1):
sum = 0
for j in xrange(0,size):
sum += tmp[i][j]
finalsum += sum
for j in xrange(1,n-size+1):
finalsum += (tmp[i][j+size-1] - tmp[i][j-1])
return finalsum
So this code works fine. Given an array and a size of subsquares, it will return the sum of the values in all this subsquares. I basically iterate over the size of subsquares to get all the possible values.
The problem is this is again waaay to long for big arrays (over 20 days for a 10.000 x 10.000 array). I googled it and learned I could vectorize the iterations over arrays with numpy. However, i couldn't figure out how to make it so in my case...
If someone can help me to speed my algorithm up, or give me good documentation on the subject, i'll be glad !
Thank you !
Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance.
Less memory used means fewer memory lookups (which are slower than CPU instructions), and hopefully fewer cache misses, which are much slower. That's another way that vectorization helps: as a pre-requisite to doing it, you need to store data in a more compact, cache-friendly way. And that means faster-running code.
The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations.
By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.
Those sliding summations are best suited to be calculated as 2D convolution summations and those could be efficiently calculated with scipy's convolve2d
. Thus, for a specific size, you could get the summations, like so -
def getSum(tab,size):
# Define kernel and perform convolution to get such sliding windowed summations
kernel = np.ones((size,size),dtype=tab.dtype)
return convolve2d(tab, kernel, mode='valid').sum()
To get summations across all sizes, I think the best way both in terms of memory and performance efficiency would be to use a loop to loop over all possible sizes. Thus, to get the final summation, you would have -
def getAllSums(tab):
finalSum = 0
for i in range(tab.shape[0]):
finalSum += getSum(tab,i+1)
return finalSum
Sample run -
In [51]: tab
Out[51]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
In [52]: getSum(tab,1) # sum of squares of size 1
Out[52]: 18
In [53]: getSum(tab,2) # sum of squares of size 2
Out[53]: 32
In [54]: getSum(tab,3) # sum of squares of size 3
Out[54]: 18
In [55]: getAllSums(tab) # sum of squares of all sizes
Out[55]: 68
Following the excellent idea of @Divakar, I would suggest using integral images to speedup convolutions. If the matrix is very big, you have to convolve it several times (once for each kernel size). Several convolutions (or evaluations of sums inside a square) can be very efficiently computed using integral images (aka summed area tables).
Once an integral image M
is computed, the sum of all values inside a region (x0, y0) - (x1, y1)
can be computed with just 4 aritmetic computations, regardless of the size of the window (picture from wikipedia):
M[x1, y1] - M[x1, y0] - M[x0, y1] + M[x0, y0]
This can be very easily vectorized in numpy. An integral images can be calculated with cumsum
. Following the example:
tab = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
M = tab.cumsum(0).cumsum(1) # Create integral images
M = np.pad(M, ((1,0), (1,0)), mode='constant') # pad it with a row and column of zeros
M
is padded with a row and a column of zeros to handle the first row (where x0 = 0
or y0 = 0
).
Then, given a window size W
, the sum of EVERY window of size W
can be computed efficiently and fully vectorized with numpy as:
all_sums = M[W:, W:] - M[:-W, W:] - M[W:, :-W] + M[:-W, :-W]
Note that the vectorized operation above, calculates the sum of every window, i.e. every A, B, C, and D of the matrix. The sum of all windows is then calculated as
total = all_sums.sum()
Note that for N
different sizes, different to convolutions, the integral image has to be computed only once, thus, the code can be written very efficiently as:
def get_all_sums(A):
M = A.cumsum(0).cumsum(1)
M = np.pad(M, ((1,0), (1,0)), mode='constant')
total = 0
for W in range(1, A.shape[0] + 1):
tmp = M[W:, W:] + M[:-W, :-W] - M[:-W, W:] - M[W:, :-W]
total += tmp.sum()
return total
The output for the example:
>>> get_all_sums(tab)
68
Some timings comparing convolutions to integral images with different size matrices. getAllSums
refeers to Divakar's convolutional method, while get_all_sums
to the integral images based method described above:
>>> R1 = np.random.randn(10, 10)
>>> R2 = np.random.randn(100, 100)
1) With R1
10x10 matrix:
>>> %time getAllSums(R1)
CPU times: user 353 µs, sys: 9 µs, total: 362 µs
Wall time: 335 µs
2393.5912717342017
>>> %time get_all_sums(R1)
CPU times: user 243 µs, sys: 0 ns, total: 243 µs
Wall time: 248 µs
2393.5912717342012
2) With R2
100x100 matrix:
>>> %time getAllSums(R2)
CPU times: user 698 ms, sys: 0 ns, total: 698 ms
Wall time: 701 ms
176299803.29826894
>>> %time get_all_sums(R2)
CPU times: user 2.51 ms, sys: 0 ns, total: 2.51 ms
Wall time: 2.47 ms
176299803.29826882
Note that using integral images is 300 times faster than convolutions for large enough matrices.
Based the idea to calculate how many times each number counted, I came to this simple code:
def get_sum(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
for k in range(1, n + 1):
# k is the square size. count is times of the number counted.
count = min(k, n - k + 1, i + 1, n - i) * min(k, n - k + 1, j + 1, n - j)
ret += count * matrix[i][j]
return ret
a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
print get_sum(a, 3) # 68
Divakar's solution is fantastic, however, I think mine could be more efficient, at least in asymptotical time complexity (O(n^3) compared with Divakar's O(n^3logn)).
I get a O(n^2) solution now...
Basically, we can get that:
def get_sum2(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
x = min(i + 1, n - i)
y = min(j + 1, n - j)
# k < half
half = (n + 1) / 2
for k in range(1, half + 1):
count = min(k, x) * min(k, y)
ret += count * matrix[i][j]
# k >= half
for k in range(half + 1, n + 1):
count = min(n + 1 - k, x) * min(n + 1 - k, y)
ret += count * matrix[i][j]
return ret
You can see sum(min(k, x) * min(k, y))
can be calculated in O(1) when 1 <= k <= n/2
So we came to that O(n^2) code:
def get_square_sum(n):
return n * (n + 1) * (2 * n + 1) / 6
def get_linear_sum(a, b):
return (b - a + 1) * (a + b) / 2
def get_count(x, y, k_end):
# k <= min(x, y), count is k*k
sum1 = get_square_sum(min(x, y))
# k > min(x, y) and k <= max(x, y), count is k * min(x, y)
sum2 = get_linear_sum(min(x, y) + 1, max(x, y)) * min(x, y)
# k > max(x, y), count is x * y
sum3 = x * y * (k_end - max(x, y))
return sum1 + sum2 + sum3
def get_sum3(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
x = min(i + 1, n - i)
y = min(j + 1, n - j)
half = n / 2
# k < half
ret += get_count(x, y, half) * matrix[i][j]
# k >= half
ret += get_count(x, y, half + half % 2) * matrix[i][j]
return ret
Test:
a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
n = 1000
b = [[1] * n] * n
print get_sum3(a, 3) # 68
print get_sum3(b, n) # 33500333666800
You can rewrite my O(n^2) Python code to C and I believe it will result a very efficient solution...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With