Optimizing array transposing function

Tags:

I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.

Problem:

//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
    int i, j;

    for(i = 0; i < dim; i++) {
        for(j = 0; j < dim; j++) {
                dst[j*dim + i] = src[i*dim + j];
        }
    }
}

What I have so far:

//attempt 1
void transpose(int *dst, int *src, int dim) {
    int i, j, id, jd;

    id = 0;
    for(i = 0; i < dim; i++, id+=dim) {
        jd = 0;
        for(j = 0; j < dim; j++, jd+=dim) {
                dst[jd + i] = src[id + j];
        }
    }
}

//attempt 2
void transpose(int *dst, int *src, int dim) {
    int i, j, id;
    int *pd, *ps;
    id = 0;
    for(i = 0; i < dim; i++, id+=dim) {
        pd = dst + i;
        ps = src + id;
        for(j = 0; j < dim; j++) {
                *pd = *ps++;
                pd += dim;
        }
    }
}

Some ideas, please correct me if I'm wrong:

I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.

Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.

I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.

Any help would be greatly appreciated.

Thanks

798

asked May 30 '12 05:05

Glen Takahashi

1 Answers

Cache blocking can be useful. For an example, lets say we have a cache line size of 64 bytes (which is what x86 uses these days). So for a large enough matrix such that it's larger than the cache size, then if we transpose a 16x16 block (since sizeof(int) == 4, thus 16 ints fit in a cache line, assuming the matrix is aligned on a cacheline bounday) we need to load 32 (16 from the source matrix, 16 from the destination matrix before we can dirty them) cache lines from memory and store another 16 lines (even though the stores are not sequential). In contrast, without cache blocking transposing the equivalent 16*16 elements requires us to load 16 cache lines from the source matrix, but 16*16=256 cache lines to be loaded and then stored for the destination matrix.

103

answered Oct 16 '22 18:10

janneb

Related questions
                            
                                Convert RGB to YCbCr - C code
                            
                                retrieving ip and port from a sockaddr_storage
                            
                                How can I limit the number of digits displayed by printf after the decimal point?
                            
                                Circular buffer implementation in C
                            
                                Can you avoid using temporary buffers when using std::string to interact with C style APIs?
                            
                                Waiting for a single event in OpenCL
                            
                                Lvalue required error
                            
                                what does the -p and -g flag in compiler
                            
                                Generating a random Gaussian double in Objective-C/C
                            
                                Is there a C-callable library that can generate PNG from raw data?
                            
                                How is performance dependent on the underlying data values
                            
                                Pointer to a structure that has not been declared [duplicate]
                            
                                What is the use of pragma code section and data section?
                            
                                What's the recommended Bcrypt C implementation? [closed]
                            
                                How to make `make` make an a executable for each C file in a folder?
                            
                                Define rectangle as two points or origin / size?
                            
                                Extracting individual digits from a long in C
                            
                                Floating-point division - bias to avoid a result less than an 'exact' value
                            
                                Running MMU-less Linux on ARM Cortex-R4
                            
                                fopen issue in iOS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimizing array transposing function

Tags:

c

loops

optimization

caching

matrix

Glen Takahashi

People also ask

1 Answers

janneb

Recent Activity

Donate For Us