Optimizing C loops to get diagonal of array

Tags:

The Great God Google has not been forthcoming to me with an explanation for some loop optimization issues. So, in sadness that I have insufficient Google-fu, I turn to you StackOverflow.

I'm optimizing a C program for solving a specific system of differential equations. During the course of finding the numerical solution I call a function that sets up a linear system of equations and then a function that solves it.

The solution function originally had a bottleneck during the access of the elements on the diagonal of the array that defines the linear system. So I included a 1-D array that is set during the initialization of the system that holds the values along the diagonal of the array.

For fun I continued to play with the code that initialized the diagonal elements, measuring the time it took and trying to continually improve the code. The versions I tried led to a few questions:

Note: I put all the versions I tried into one function and profiled this function to see where the time was being spent. I'll report execution time for a version in percent of total time in function. The function was evaluated several million times. Smaller number is better.

Relevant declarations of data used in code:

/* quick definitions of the relevant variables, data is a struct */

static const int sp_diag_ind[98] = {2,12,23,76,120,129,137,142,.../* long list */};

double *spJ = &(data->spJ[0]);
/* data has double spJ[908] that represents a sparse matrix stored in triplet
*  form, I grab the pointer because I've found it to be more 
*  efficient than referencing data->spJ[x] each time I need it
*/

int iter,jter;
double *diag_data = NV_DATA_S(data->J_diag);
/* data->J_diag has a content field that has an array double diag_data[150]
*  NV_DATA_S is a macro to return the pointer to the relevant data
*/

My original loop for initializing diag_data. Timing was 16.1% of evaluation (see note).

/* try 1 */
for (iter = 0; iter<3; iter++) {
    diag_data[iter] = 0; 
}
jter = 0;
for (iter = 3; iter<101; iter++) { // unaligned loop start
    diag_data[iter] = spJ[sp_diag_ind[jter]];
    jter++; // heavy line for loop
}

for (iter = 101; iter<150; iter++) {
    diag_data[iter] = 0; 
}

To summarize, we grab the pointer to the diagonal, set the some components to zero (this is not optional based on the algorithm I'm using), then grab the values that reside on the diagonal of the "array" represented in sparse form by spJ. Since spJ is a 1-D array of the 908 non-zeros of a (mostly zero) 150x150 array we must use a lookup to find the positions of the diagonal elements in spJ. This lookup is defined by the 98 element array sp_diag_ind.

I tried to remove the use of jter because it showed up as not free to increment. The middle loop of my second try:

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This improved things a bit. Timing was 15.6% for this version. But when I look at the Shark analysis of this code (tool that comes with XCode on a Mac), it warns me that this is an unaligned loop.

The third attempt to improve was by removing the "zeroing" loops and using memset to zero diag_data:

memset(diag_data, '\0', sizeof(diag_data));

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This was timed at 14.9%. Not being sure what an unaligned loop is, I continued to fiddle. I found an improved, fourth implementation, doing the alignment offset between diag_data and spJ[crazy index] with a pointer:

realtype * diag_mask = &diag_data[3];
for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_mask[iter] = spJ[sp_diag_ind[iter]];
}

Using diag_mask allowed for a small improvement in speed. It came in at 13.1%.

Edit: Turns out this section was sillier than I had originally thought. The use of iter is undefined. Props to @caf and @rlibby for catching it.

Lastly, I then I tried something I thought was silly:

memset(diag_data, '\0', sizeof(diag_data));

for (iter = 0; iter<98;) {
    diag_mask[iter] = spJ[sp_diag_ind[iter++]];
}

This was timed at 10.9%. Also, Shark doesn't issue the unaligned loop warning when I look at the annotated source code. End silly section

So, my questions:

What's an unaligned loop?
Why is the fifth implementation aligned and the fourth not?
Is having an aligned loop responsible for the improvement in execution speed between my fourth and fifth implementations, or is embedding the increment step in the lookup of the value of sp_diag_ind responsible?
Do you see any other improvements I can make?

Thanks for the help.

--Andrew

815

asked Feb 26 '11 03:02

Sevenless

1 Answers

An unaligned loop is one where the first instruction doesn't start on a particular boundary (multiple of 16 or 32). There should be a compiler flag to align loops; it may or may not help performance. Whether a loop is aligned or not without the flag is based on exactly which instructions come before it, so it's not predictable. One other optimization you could try is to mark diag_mask, spJ, and sp_diag_ind as restrict (a C99 feature). That indicates that they do not alias and might help the compiler to optimize the loop better. A count of 98 will probably be too small to see any effect, though.

185

answered Oct 23 '22 01:10

Jeremiah Willcock

Related questions
                            
                                Should I teach C89, C18 or something else? [closed]
                            
                                Handling an undefined instruction in the kernel
                            
                                Is it safe to "play" with parameter constness in extern "C" declarations?
                            
                                Why do I get connection refused after 1024 connections?
                            
                                Building a Python shared object binding with cmake, which depends upon external libraries
                            
                                Obj-C: C functions declared inside or outside @implementation block, what's the difference?
                            
                                Is there anything to change the exports name mangling scheme in GCC?
                            
                                excellent setjmp/longjmp tutorials [closed]
                            
                                GCC outputs error "undefined reference to `printf'" when using an NASM extern statement to access printf
                            
                                Get the logged in Windows user name associated with a desktop
                            
                                Passing a 256-bit wire to a C function through the Verilog VPI
                            
                                Sample C code for Canon EDSDK Liveview?
                            
                                How to build-in gprof support to a program built with SCons?
                            
                                cmd- comma to separate parameters Compared to space?
                            
                                Mouse cursor position in C on multi screen system
                            
                                Open Source OCR system for FPGA [closed]
                            
                                C Directed Graph Implementation Choice
                            
                                Linux Device Driver: Symbol "memcpy" not found
                            
                                C zip library that can create password protected zip files on windows?
                            
                                Running a C program with multiple threads in the background when it requires user input

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimizing C loops to get diagonal of array

Tags:

c

loops

optimization

shark

Sevenless

People also ask

1 Answers

Jeremiah Willcock

Recent Activity

Donate For Us