Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I vectorize and speed up this large array calculation?

I'm currently trying to calculate the sum of all sum of subsquares in a 10.000 x 10.000 array of values. As an example, if my array was :

1 1 1 
2 2 2
3 3 3 

I want the result to be :

1+1+1+2+2+2+3+3+3                        [sum of squares of size 1]
+(1+1+2+2)+(1+1+2+2)+(2+2+3+3)+(2+2+3+3) [sum of squares of size 2]
+(1+1+1+2+2+2+3+3+3)                     [sum of squares of size 3]
________________________________________
68

So, as a first try i wrote a very simple python code to do that. As it was in O(k^2.n^2) (n being the size of the big array and k the size of the subsquares we are getting), the processing was awfully long. I wrote another algorithm in O(n^2) to speed it up :

def getSum(tab,size):
    n = len(tab)
    tmp = numpy.zeros((n,n))

    for i in xrange(0,n):
        sum = 0
        for j in xrange(0,size):
            sum += tab[j][i]
        tmp[0][i] = sum

        for j in xrange(1,n-size+1):
            sum += (tab[j+size-1][i] - tab[j-1][i])
            tmp[j][i] = sum

    finalsum = 0
    for i in xrange(0,n-size+1):
        sum = 0 
        for j in xrange(0,size):
            sum += tmp[i][j]
        finalsum += sum

        for j in xrange(1,n-size+1):
            finalsum += (tmp[i][j+size-1] - tmp[i][j-1])

return finalsum

So this code works fine. Given an array and a size of subsquares, it will return the sum of the values in all this subsquares. I basically iterate over the size of subsquares to get all the possible values.

The problem is this is again waaay to long for big arrays (over 20 days for a 10.000 x 10.000 array). I googled it and learned I could vectorize the iterations over arrays with numpy. However, i couldn't figure out how to make it so in my case...

If someone can help me to speed my algorithm up, or give me good documentation on the subject, i'll be glad !

Thank you !

like image 992
madfrog Avatar asked May 18 '16 07:05

madfrog


People also ask

Does NumPy vectorize fast?

Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance.

How vectorization speeds up your Python code?

Less memory used means fewer memory lookups (which are slower than CPU instructions), and hopefully fewer cache misses, which are much slower. That's another way that vectorization helps: as a pre-requisite to doing it, you need to store data in a more compact, cache-friendly way. And that means faster-running code.

What advantage does vectorization with NumPy's arrays provide?

The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations.

How can I speed up my NumPy code?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.


3 Answers

Those sliding summations are best suited to be calculated as 2D convolution summations and those could be efficiently calculated with scipy's convolve2d. Thus, for a specific size, you could get the summations, like so -

def getSum(tab,size):
    # Define kernel and perform convolution to get such sliding windowed summations
    kernel = np.ones((size,size),dtype=tab.dtype)
    return convolve2d(tab, kernel, mode='valid').sum()

To get summations across all sizes, I think the best way both in terms of memory and performance efficiency would be to use a loop to loop over all possible sizes. Thus, to get the final summation, you would have -

def getAllSums(tab):
    finalSum = 0
    for i in range(tab.shape[0]):
        finalSum += getSum(tab,i+1)
    return finalSum

Sample run -

In [51]: tab
Out[51]: 
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

In [52]: getSum(tab,1) # sum of squares of size 1
Out[52]: 18

In [53]: getSum(tab,2) # sum of squares of size 2
Out[53]: 32

In [54]: getSum(tab,3) # sum of squares of size 3
Out[54]: 18

In [55]: getAllSums(tab) # sum of squares of all sizes
Out[55]: 68
like image 33
Divakar Avatar answered Nov 02 '22 06:11

Divakar


Following the excellent idea of @Divakar, I would suggest using integral images to speedup convolutions. If the matrix is very big, you have to convolve it several times (once for each kernel size). Several convolutions (or evaluations of sums inside a square) can be very efficiently computed using integral images (aka summed area tables).

Once an integral image M is computed, the sum of all values inside a region (x0, y0) - (x1, y1) can be computed with just 4 aritmetic computations, regardless of the size of the window (picture from wikipedia):

M[x1, y1] - M[x1, y0] - M[x0, y1] + M[x0, y0]

Link from wikipedia

This can be very easily vectorized in numpy. An integral images can be calculated with cumsum. Following the example:

tab = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
M = tab.cumsum(0).cumsum(1) # Create integral images
M = np.pad(M, ((1,0), (1,0)), mode='constant') # pad it with a row and column of zeros

M is padded with a row and a column of zeros to handle the first row (where x0 = 0 or y0 = 0).

Then, given a window size W, the sum of EVERY window of size W can be computed efficiently and fully vectorized with numpy as:

all_sums = M[W:, W:] - M[:-W, W:] - M[W:, :-W] + M[:-W, :-W]

Note that the vectorized operation above, calculates the sum of every window, i.e. every A, B, C, and D of the matrix. The sum of all windows is then calculated as

total = all_sums.sum()

Note that for N different sizes, different to convolutions, the integral image has to be computed only once, thus, the code can be written very efficiently as:

def get_all_sums(A):
    M = A.cumsum(0).cumsum(1)
    M = np.pad(M, ((1,0), (1,0)), mode='constant')

    total = 0
    for W in range(1, A.shape[0] + 1):
        tmp = M[W:, W:] + M[:-W, :-W] - M[:-W, W:] - M[W:, :-W]
        total += tmp.sum()

    return total

The output for the example:

>>> get_all_sums(tab)
68

Some timings comparing convolutions to integral images with different size matrices. getAllSums refeers to Divakar's convolutional method, while get_all_sums to the integral images based method described above:

>>> R1 = np.random.randn(10, 10)
>>> R2 = np.random.randn(100, 100)

1) With R1 10x10 matrix:

>>> %time getAllSums(R1)
CPU times: user 353 µs, sys: 9 µs, total: 362 µs
Wall time: 335 µs
2393.5912717342017

>>> %time get_all_sums(R1)
CPU times: user 243 µs, sys: 0 ns, total: 243 µs
Wall time: 248 µs
2393.5912717342012

2) With R2 100x100 matrix:

>>> %time getAllSums(R2)
CPU times: user 698 ms, sys: 0 ns, total: 698 ms
Wall time: 701 ms
176299803.29826894

>>> %time get_all_sums(R2)
CPU times: user 2.51 ms, sys: 0 ns, total: 2.51 ms
Wall time: 2.47 ms
176299803.29826882

Note that using integral images is 300 times faster than convolutions for large enough matrices.

like image 104
Imanol Luengo Avatar answered Nov 02 '22 06:11

Imanol Luengo


Based the idea to calculate how many times each number counted, I came to this simple code:

def get_sum(matrix, n):
    ret = 0
    for i in range(n):
        for j in range(n):
            for k in range(1, n + 1):
                # k is the square size. count is times of the number counted.
                count = min(k, n - k + 1, i + 1, n - i) * min(k, n - k + 1, j + 1, n - j)
                ret += count * matrix[i][j]
    return ret

a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]

print get_sum(a, 3) # 68

Divakar's solution is fantastic, however, I think mine could be more efficient, at least in asymptotical time complexity (O(n^3) compared with Divakar's O(n^3logn)).


I get a O(n^2) solution now...

Basically, we can get that:

def get_sum2(matrix, n):
    ret = 0
    for i in range(n):
        for j in range(n):
            x = min(i + 1, n - i)
            y = min(j + 1, n - j)
            # k < half
            half = (n + 1) / 2
            for k in range(1, half + 1):
                count = min(k, x) * min(k, y)
                ret += count * matrix[i][j]
            # k >= half
            for k in range(half + 1, n + 1):
                count = min(n + 1 - k, x) * min(n + 1 - k, y)
                ret += count * matrix[i][j]
    return ret

You can see sum(min(k, x) * min(k, y)) can be calculated in O(1) when 1 <= k <= n/2

So we came to that O(n^2) code:

def get_square_sum(n):
    return n * (n + 1) * (2 * n + 1) / 6


def get_linear_sum(a, b):
    return (b - a + 1) * (a + b) / 2


def get_count(x, y, k_end):
    # k <= min(x, y), count is k*k
    sum1 = get_square_sum(min(x, y))

    # k > min(x, y) and k <= max(x, y), count is k * min(x, y)
    sum2 = get_linear_sum(min(x, y) + 1, max(x, y)) * min(x, y)

    # k > max(x, y), count is x * y
    sum3 = x * y * (k_end - max(x, y))

    return sum1 + sum2 + sum3


def get_sum3(matrix, n):
    ret = 0
    for i in range(n):
        for j in range(n):
            x = min(i + 1, n - i)
            y = min(j + 1, n - j)
            half = n / 2

            # k < half
            ret += get_count(x, y, half) * matrix[i][j]
            # k >= half
            ret += get_count(x, y, half + half % 2) * matrix[i][j]

    return ret 

Test:

a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
n = 1000
b = [[1] * n] * n
print get_sum3(a, 3) # 68
print get_sum3(b, n) # 33500333666800

You can rewrite my O(n^2) Python code to C and I believe it will result a very efficient solution...

like image 36
Sayakiss Avatar answered Nov 02 '22 06:11

Sayakiss