A Cache Efficient Matrix Transpose Program?

Tags:

So the obvious way to transpose a matrix is to use :

  for( int i = 0; i < n; i++ )      for( int j = 0; j < n; j++ )        destination[j+i*n] = source[i+j*n];

but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?

Edit: I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.

Edit2: The matrices are stored in column major order, that is to say for a matrix

a1 a2     a3 a4

is stored as a1 a3 a2 a4.

335

asked Mar 04 '11 23:03

user635832

1 Answers

You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:

for (int i = 0; i < n; i += blocksize) {     for (int j = 0; j < n; j += blocksize) {         // transpose the block beginning at [i,j]         for (int k = i; k < i + blocksize; ++k) {             for (int l = j; l < j + blocksize; ++l) {                 dst[k + l*n] = src[l + k*n];             }         }     } }

An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.

Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.

answered Oct 04 '22 14:10

Steve Jessop

Related questions
                            
                                Implementation of a work stealing queue in C/C++? [closed]
                            
                                Help Understanding Cross Validation and Decision Trees
                            
                                Bad implementation of Enumerable.Single?
                            
                                Grouping numbers based on occurrences?
                            
                                Levenshtein distance: how to better handle words swapping positions?
                            
                                Dividing a plane of points into two equal halves [closed]
                            
                                Fast algorithm for repeated calculation of percentile?
                            
                                Python - Speed up an A Star Pathfinding Algorithm
                            
                                O(klogk) time algorithm to find kth smallest element from a binary heap
                            
                                Efficient Array Storage for Binary Tree
                            
                                Longest increasing subsequence
                            
                                Loop invariant of linear search
                            
                                String analysis
                            
                                Most efficient way to escape XML/HTML in C++ string?
                            
                                How many hash functions are required in a minhash algorithm
                            
                                Fastest algorithm for primality test [closed]
                            
                                Quick sort Worst case
                            
                                Reservoir sampling
                            
                                Python implementation of the Wilson Score Interval?
                            
                                LRU cache implementation in Javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

A Cache Efficient Matrix Transpose Program?

Tags:

algorithm

caching

matrix

user635832

People also ask

1 Answers

Steve Jessop

Recent Activity

Donate For Us