I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is <pre class="prettyprint"><code>int diagonal_diff(int x[512][512]) </code></pre> I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms. Sample data: <pre class="prettyprint"><code>[3][4][5][9] [2][8][9][4] [6][9][7][3] [5][8][8][2] </code></pre> The difference is: <pre class="prettyprint"><code>|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16 </code></pre> In order to do that, I use the following algorithm: <pre class="prettyprint"><code>int i,j,result=0; for(i=0; i<4; i++) for(j=0; j<4; j++) result+=abs(array[i][j]-[j][i]); return result; </code></pre> But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache. Is there a way to improve my function?

EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache. For example, if you have a cache line of 32-bytes and an <code>int</code> is 4 bytes, you can fit a 8x8 <code>int</code> matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want. If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit. A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache. So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once: <pre class="prettyprint"><code>int sum_diagonal_difference(int array[512][512], int block_size) { int i,j, block_i, block_j,result=0; // sum diagonal blocks for (block_i= 0; block_i<512; block_i+= block_size) for (block_j= block_i + block_size; block_j<512; block_j+= block_size) for(i=0; i<block_size; i++) for(j=0; j<block_size; j++) result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]); result+= result; // sum diagonal for (int block_offset= 0; block_offset<512; block_offset+= block_size) { for (i= 0; i<block_size; ++i) { for (j= i+1; j<block_size; ++j) { int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]); result+= value + value; } } } return result; } </code></pre> You should experiment with various values for <code>block_size</code>. On my machine, <code>8</code> lead to the biggest speed up (2.5x) compared to a <code>block_size</code> of 1 (and ~5x compared to the original iteration over the entire matrix). The <code>block_size</code> should ideally be <code>cache_line_size_in_bytes/sizeof(int)</code>.

Improve C function performance with cache locality?

Tags:

c

optimization

matrix

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is

int diagonal_diff(int x[512][512])

I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.

Sample data:

[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]

The difference is:

|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16

In order to do that, I use the following algorithm:

int i,j,result=0;
for(i=0; i<4; i++)
    for(j=0; j<4; j++)
        result+=abs(array[i][j]-[j][i]);

return result;

But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.

Is there a way to improve my function?

216

asked Oct 12 '11 02:10

Christoper Hans

1 Answers

EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.

For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.

If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.

A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.

So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:

int sum_diagonal_difference(int array[512][512], int block_size)
{
    int i,j, block_i, block_j,result=0;

     // sum diagonal blocks
    for (block_i= 0; block_i<512; block_i+= block_size)
        for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
            for(i=0; i<block_size; i++)
                for(j=0; j<block_size; j++)
                    result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);

    result+= result;

     // sum diagonal
    for (int block_offset= 0; block_offset<512; block_offset+= block_size)
    {
        for (i= 0; i<block_size; ++i)
        {
            for (j= i+1; j<block_size; ++j)
            {
                int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
                result+= value + value;
            }
        }
    }

    return result;
}

You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).

155

answered Sep 22 '22 13:09

MSN

Related questions
                            
                                Efficient algorithm to produce the n-way intersection of sorted arrays in C
                            
                                C99, "Despite the name, a non-directive is a preprocessing directive."
                            
                                Unable to get JNIEnv* value in arbitrary context
                            
                                How can I copy a repeating pattern into a memory buffer?
                            
                                Any tips on Linux programming for Windows programmer (C/C++)? [duplicate]
                            
                                Will Visual C++ runtime malloc / free return memory to OS?
                            
                                how do you get how long a process has been running?
                            
                                How can I find the number of terminal columns from a C/C++ program? [duplicate]
                            
                                What does an asterisk before a function name mean?
                            
                                Multithreading with Matlab
                            
                                Harmful C Source File Check?
                            
                                How to get process signal information in GDB?
                            
                                Cannot use SSSE3 on enabled cpu
                            
                                Find if every even bit is set to 0 using bitwise operators
                            
                                gmp shared libraries not found
                            
                                Using strcpy with a string array in C
                            
                                stack and heap addresses
                            
                                Is there any any way to compile & run program as big-endian on little endian pc?
                            
                                gprof command is not creating proper out.txt
                            
                                CMAKE linking to system library

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With