I want to apply Arnold's cat map to my matrix. Here is my implementation:
import numpy as np
def cat_mapping(matrix, MAX):
width, height = matrix.shape
transformed_matrix = np.empty([width, height]).astype(np.uint8)
counter = 1
x = np.repeat(np.arange(width).reshape(-1, 1), height, axis=-1).T
y = np.repeat(np.arange(height).reshape(-1, 1), width, axis=-1)
nx = (2 * x + y) % width
ny = (x + y) % height
while counter <= MAX:
transformed_matrix[ny, nx] = matrix[y, x]
matrix = transformed_matrix
if counter != MAX:
transformed_matrix = np.empty([width, height])
counter = counter + 1
return transformed_matrix
Which work perfectly. But when the size of the array increase >10000 with bigger iteration value MAX, this implementation became really slow. Even I use numba, but the result is not satisfactory.
I was thinking, can the transformation could be broken into smaller part and combine the result like Divide and Conquer does?
@JeromeRichard helped to make it faster using numba which is nice. But, I think, is it become more faster if somehow we manage to implement DC paradigm?. I tried to implement with some demo data like this:
def split(matrix):
row, col = matrix.shape
row2, col2 = row//2, col//2
return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2], matrix[row2:, col2:]
main = np.arange(1000*1000).reshape(1000,1000)
a,b,c,d = split(main)
a = cat_mapping_fast(a,100)
b = cat_mapping_fast(b,100)
c = cat_mapping_fast(c,100)
d = cat_mapping_fast(d,100)
np.vstack((np.hstack((a, b)), np.hstack((c, d))))
But I couldn't come up with deeper recursion because of "How to merge them?".
Any solution or hint will be appreciated.
The current code is quite slow because of matrix[y, x] create a new temporary array, and transformed_matrix[ny, nx] = matrix[y, x] is pretty slow since it needs to read nx and ny from the memory hierarchy and the memory access pattern is not efficient. When the matrix is big, the code should be memory bound and the unneeded memory operation becomes pretty expensive. Note that the np.empty([width, height]) array contains double-precision floating-point numbers that takes 8 time more space than np.uint8 and so it is 8 times slower to fill in memory.
You can speed up a lot the code using Numba, basic loops and double buffering so to avoid creating many temporary arrays and read big ones. The idea is to compute the indices (ny, nx) on-the-fly within the loops. Since modulus are quite expensive and the memory access pattern cause the code to be more latency bound, multiple threads are used so to better saturate the memory. Here is the resulting code:
import numba as nb
@nb.njit('uint8[:,::1](uint8[:,::1], int_)', parallel=True)
def cat_mapping_fast(matrix, MAX):
height, width = matrix.shape
buff1 = np.empty((height, width), dtype=np.uint8)
buff2 = np.empty((height, width), dtype=np.uint8)
counter = 1
buff1[:,:] = matrix
for counter in range(1, MAX+1):
for y in nb.prange(height):
for x in range(width):
nx = (2 * x + y) % width
ny = (x + y) % height
buff2[ny, nx] = buff1[y, x]
buff1, buff2 = buff2, buff1
return buff1
This code is significantly faster than the initial one, especially when the 2 buffers fit in the CPU cache. When the input matrix is so huge that it does not fit in the cache, the inefficient memory access pattern makes the code a bit slower but there is not much to do since the computation appears to behave like a big shuffle which is not cache-friendly. Still, on a 4096x4096 matrix with MAX=20, the Numba code is 25 times faster on my 6-core machine (only about 0.38 seconds compared to 10 seconds for the initial code).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With