How can I vectorize and speed up this large array calculation?

Tags:

I'm currently trying to calculate the sum of all sum of subsquares in a 10.000 x 10.000 array of values. As an example, if my array was :

1 1 1 
2 2 2
3 3 3

I want the result to be :

1+1+1+2+2+2+3+3+3                        [sum of squares of size 1]
+(1+1+2+2)+(1+1+2+2)+(2+2+3+3)+(2+2+3+3) [sum of squares of size 2]
+(1+1+1+2+2+2+3+3+3)                     [sum of squares of size 3]
________________________________________
68

So, as a first try i wrote a very simple python code to do that. As it was in O(k^2.n^2) (n being the size of the big array and k the size of the subsquares we are getting), the processing was awfully long. I wrote another algorithm in O(n^2) to speed it up :

def getSum(tab,size):
    n = len(tab)
    tmp = numpy.zeros((n,n))

    for i in xrange(0,n):
        sum = 0
        for j in xrange(0,size):
            sum += tab[j][i]
        tmp[0][i] = sum

        for j in xrange(1,n-size+1):
            sum += (tab[j+size-1][i] - tab[j-1][i])
            tmp[j][i] = sum

    finalsum = 0
    for i in xrange(0,n-size+1):
        sum = 0 
        for j in xrange(0,size):
            sum += tmp[i][j]
        finalsum += sum

        for j in xrange(1,n-size+1):
            finalsum += (tmp[i][j+size-1] - tmp[i][j-1])

return finalsum

So this code works fine. Given an array and a size of subsquares, it will return the sum of the values in all this subsquares. I basically iterate over the size of subsquares to get all the possible values.

The problem is this is again waaay to long for big arrays (over 20 days for a 10.000 x 10.000 array). I googled it and learned I could vectorize the iterations over arrays with numpy. However, i couldn't figure out how to make it so in my case...

If someone can help me to speed my algorithm up, or give me good documentation on the subject, i'll be glad !

Thank you !

992

asked May 18 '16 07:05

madfrog

3 Answers

Those sliding summations are best suited to be calculated as 2D convolution summations and those could be efficiently calculated with scipy's convolve2d. Thus, for a specific size, you could get the summations, like so -

def getSum(tab,size):
    # Define kernel and perform convolution to get such sliding windowed summations
    kernel = np.ones((size,size),dtype=tab.dtype)
    return convolve2d(tab, kernel, mode='valid').sum()

To get summations across all sizes, I think the best way both in terms of memory and performance efficiency would be to use a loop to loop over all possible sizes. Thus, to get the final summation, you would have -

def getAllSums(tab):
    finalSum = 0
    for i in range(tab.shape[0]):
        finalSum += getSum(tab,i+1)
    return finalSum

Sample run -

In [51]: tab
Out[51]: 
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

In [52]: getSum(tab,1) # sum of squares of size 1
Out[52]: 18

In [53]: getSum(tab,2) # sum of squares of size 2
Out[53]: 32

In [54]: getSum(tab,3) # sum of squares of size 3
Out[54]: 18

In [55]: getAllSums(tab) # sum of squares of all sizes
Out[55]: 68

answered Nov 02 '22 06:11

Divakar

Following the excellent idea of @Divakar, I would suggest using integral images to speedup convolutions. If the matrix is very big, you have to convolve it several times (once for each kernel size). Several convolutions (or evaluations of sums inside a square) can be very efficiently computed using integral images (aka summed area tables).

Once an integral image M is computed, the sum of all values inside a region (x0, y0) - (x1, y1) can be computed with just 4 aritmetic computations, regardless of the size of the window (picture from wikipedia):

M[x1, y1] - M[x1, y0] - M[x0, y1] + M[x0, y0]

Link from wikipedia

This can be very easily vectorized in numpy. An integral images can be calculated with cumsum. Following the example:

tab = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
M = tab.cumsum(0).cumsum(1) # Create integral images
M = np.pad(M, ((1,0), (1,0)), mode='constant') # pad it with a row and column of zeros

M is padded with a row and a column of zeros to handle the first row (where x0 = 0 or y0 = 0).

Then, given a window size W, the sum of EVERY window of size W can be computed efficiently and fully vectorized with numpy as:

all_sums = M[W:, W:] - M[:-W, W:] - M[W:, :-W] + M[:-W, :-W]

Note that the vectorized operation above, calculates the sum of every window, i.e. every A, B, C, and D of the matrix. The sum of all windows is then calculated as

total = all_sums.sum()

Note that for N different sizes, different to convolutions, the integral image has to be computed only once, thus, the code can be written very efficiently as:

def get_all_sums(A):
    M = A.cumsum(0).cumsum(1)
    M = np.pad(M, ((1,0), (1,0)), mode='constant')

    total = 0
    for W in range(1, A.shape[0] + 1):
        tmp = M[W:, W:] + M[:-W, :-W] - M[:-W, W:] - M[W:, :-W]
        total += tmp.sum()

    return total

The output for the example:

>>> get_all_sums(tab)
68

Some timings comparing convolutions to integral images with different size matrices. getAllSums refeers to Divakar's convolutional method, while get_all_sums to the integral images based method described above:

>>> R1 = np.random.randn(10, 10)
>>> R2 = np.random.randn(100, 100)

1) With R1 10x10 matrix:

>>> %time getAllSums(R1)
CPU times: user 353 µs, sys: 9 µs, total: 362 µs
Wall time: 335 µs
2393.5912717342017

>>> %time get_all_sums(R1)
CPU times: user 243 µs, sys: 0 ns, total: 243 µs
Wall time: 248 µs
2393.5912717342012

2) With R2 100x100 matrix:

>>> %time getAllSums(R2)
CPU times: user 698 ms, sys: 0 ns, total: 698 ms
Wall time: 701 ms
176299803.29826894

>>> %time get_all_sums(R2)
CPU times: user 2.51 ms, sys: 0 ns, total: 2.51 ms
Wall time: 2.47 ms
176299803.29826882

Note that using integral images is 300 times faster than convolutions for large enough matrices.

104

answered Nov 02 '22 06:11

Imanol Luengo

Based the idea to calculate how many times each number counted, I came to this simple code:

def get_sum(matrix, n):
    ret = 0
    for i in range(n):
        for j in range(n):
            for k in range(1, n + 1):
                # k is the square size. count is times of the number counted.
                count = min(k, n - k + 1, i + 1, n - i) * min(k, n - k + 1, j + 1, n - j)
                ret += count * matrix[i][j]
    return ret

a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]

print get_sum(a, 3) # 68

Divakar's solution is fantastic, however, I think mine could be more efficient, at least in asymptotical time complexity (O(n^3) compared with Divakar's O(n^3logn)).

I get a O(n^2) solution now...

Basically, we can get that:

def get_sum2(matrix, n):
    ret = 0
    for i in range(n):
        for j in range(n):
            x = min(i + 1, n - i)
            y = min(j + 1, n - j)
            # k < half
            half = (n + 1) / 2
            for k in range(1, half + 1):
                count = min(k, x) * min(k, y)
                ret += count * matrix[i][j]
            # k >= half
            for k in range(half + 1, n + 1):
                count = min(n + 1 - k, x) * min(n + 1 - k, y)
                ret += count * matrix[i][j]
    return ret

You can see sum(min(k, x) * min(k, y)) can be calculated in O(1) when 1 <= k <= n/2

So we came to that O(n^2) code:

def get_square_sum(n):
    return n * (n + 1) * (2 * n + 1) / 6


def get_linear_sum(a, b):
    return (b - a + 1) * (a + b) / 2


def get_count(x, y, k_end):
    # k <= min(x, y), count is k*k
    sum1 = get_square_sum(min(x, y))

    # k > min(x, y) and k <= max(x, y), count is k * min(x, y)
    sum2 = get_linear_sum(min(x, y) + 1, max(x, y)) * min(x, y)

    # k > max(x, y), count is x * y
    sum3 = x * y * (k_end - max(x, y))

    return sum1 + sum2 + sum3


def get_sum3(matrix, n):
    ret = 0
    for i in range(n):
        for j in range(n):
            x = min(i + 1, n - i)
            y = min(j + 1, n - j)
            half = n / 2

            # k < half
            ret += get_count(x, y, half) * matrix[i][j]
            # k >= half
            ret += get_count(x, y, half + half % 2) * matrix[i][j]

    return ret

Test:

a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
n = 1000
b = [[1] * n] * n
print get_sum3(a, 3) # 68
print get_sum3(b, n) # 33500333666800

You can rewrite my O(n^2) Python code to C and I believe it will result a very efficient solution...

answered Nov 02 '22 06:11

Sayakiss

Related questions
                            
                                Case-insensitive search on a postgres ArrayField with django
                            
                                tensorflow: saving and restoring session
                            
                                The most elegant way to get back from pandas.df_dummies
                            
                                f2py error with allocatable arrays
                            
                                Kivy add widget at the end of layout
                            
                                Django migration being killed
                            
                                PyQt application with sqlalchemy database
                            
                                Is it possible to force exponent or significand of a float to match another float (Python)?
                            
                                How to template like ERB in Python? [closed]
                            
                                Run Alembic migrations on Google App Engine
                            
                                Change map boundary color in Cartopy
                            
                                netcat sending extra "X" UDP packets
                            
                                Returning mouse cursor coordinates in PyQtGraph
                            
                                "Bidirectional Dijkstra" by NetworkX
                            
                                Shortest Syntax To Use numpy 1d-array As sklearn X
                            
                                Django: Using named parameters on a raw SQL query
                            
                                Pandas dataframe apply function to column strings based on other column value
                            
                                Singledispatch based on value instead of type
                            
                                Can i use celery without django
                            
                                Scaled logarithmic binning in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I vectorize and speed up this large array calculation?

Tags:

python

arrays

algorithm

vectorization

numpy

madfrog

People also ask

3 Answers

Divakar

Imanol Luengo

Sayakiss

Recent Activity

Donate For Us