Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing C loops to get diagonal of array

The Great God Google has not been forthcoming to me with an explanation for some loop optimization issues. So, in sadness that I have insufficient Google-fu, I turn to you StackOverflow.

I'm optimizing a C program for solving a specific system of differential equations. During the course of finding the numerical solution I call a function that sets up a linear system of equations and then a function that solves it.

The solution function originally had a bottleneck during the access of the elements on the diagonal of the array that defines the linear system. So I included a 1-D array that is set during the initialization of the system that holds the values along the diagonal of the array.

For fun I continued to play with the code that initialized the diagonal elements, measuring the time it took and trying to continually improve the code. The versions I tried led to a few questions:

Note: I put all the versions I tried into one function and profiled this function to see where the time was being spent. I'll report execution time for a version in percent of total time in function. The function was evaluated several million times. Smaller number is better.

Relevant declarations of data used in code:

/* quick definitions of the relevant variables, data is a struct */

static const int sp_diag_ind[98] = {2,12,23,76,120,129,137,142,.../* long list */};

double *spJ = &(data->spJ[0]);
/* data has double spJ[908] that represents a sparse matrix stored in triplet
*  form, I grab the pointer because I've found it to be more 
*  efficient than referencing data->spJ[x] each time I need it
*/

int iter,jter;
double *diag_data = NV_DATA_S(data->J_diag);
/* data->J_diag has a content field that has an array double diag_data[150]
*  NV_DATA_S is a macro to return the pointer to the relevant data
*/

My original loop for initializing diag_data. Timing was 16.1% of evaluation (see note).

/* try 1 */
for (iter = 0; iter<3; iter++) {
    diag_data[iter] = 0; 
}
jter = 0;
for (iter = 3; iter<101; iter++) { // unaligned loop start
    diag_data[iter] = spJ[sp_diag_ind[jter]];
    jter++; // heavy line for loop
}

for (iter = 101; iter<150; iter++) {
    diag_data[iter] = 0; 
}

To summarize, we grab the pointer to the diagonal, set the some components to zero (this is not optional based on the algorithm I'm using), then grab the values that reside on the diagonal of the "array" represented in sparse form by spJ. Since spJ is a 1-D array of the 908 non-zeros of a (mostly zero) 150x150 array we must use a lookup to find the positions of the diagonal elements in spJ. This lookup is defined by the 98 element array sp_diag_ind.

I tried to remove the use of jter because it showed up as not free to increment. The middle loop of my second try:

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This improved things a bit. Timing was 15.6% for this version. But when I look at the Shark analysis of this code (tool that comes with XCode on a Mac), it warns me that this is an unaligned loop.

The third attempt to improve was by removing the "zeroing" loops and using memset to zero diag_data:

memset(diag_data, '\0', sizeof(diag_data));

for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_data[iter+3] = spJ[sp_diag_ind[iter]];
}

This was timed at 14.9%. Not being sure what an unaligned loop is, I continued to fiddle. I found an improved, fourth implementation, doing the alignment offset between diag_data and spJ[crazy index] with a pointer:

realtype * diag_mask = &diag_data[3];
for (iter = 0; iter<98; iter++) { // unaligned loop start
    diag_mask[iter] = spJ[sp_diag_ind[iter]];
}

Using diag_mask allowed for a small improvement in speed. It came in at 13.1%.

Edit: Turns out this section was sillier than I had originally thought. The use of iter is undefined. Props to @caf and @rlibby for catching it.

Lastly, I then I tried something I thought was silly:

memset(diag_data, '\0', sizeof(diag_data));

for (iter = 0; iter<98;) {
    diag_mask[iter] = spJ[sp_diag_ind[iter++]];
}

This was timed at 10.9%. Also, Shark doesn't issue the unaligned loop warning when I look at the annotated source code. End silly section

So, my questions:

  1. What's an unaligned loop?
  2. Why is the fifth implementation aligned and the fourth not?
  3. Is having an aligned loop responsible for the improvement in execution speed between my fourth and fifth implementations, or is embedding the increment step in the lookup of the value of sp_diag_ind responsible?
  4. Do you see any other improvements I can make?

Thanks for the help.

--Andrew

like image 815
Sevenless Avatar asked Feb 26 '11 03:02

Sevenless


People also ask

How do you find the diagonal of an array?

Consider a square matrix a of size NxN , say row index i = 0 to N - 1 column index j = 0 to N -1 Then, for each element a(i,j) in the array, if (i == j), then, the element a(i,j) is in the main diagonal.

How do you find the diagonal of a matrix in C?

How do you find the diagonal of a matrix in C? Create a matrix and define its elements. Declare two variables which will store sum of main and opposite diagonal. Now run a single for loop and extract main diagonals elements adding to the first variable and opposite diagonal elements to the second variable.

What is the condition to access principal diagonal elements of two dimensional array?

Condition for Principal Diagonal: The row-column condition is row = column. The secondary diagonal is formed by the elements A03, A12, A21, A30. Condition for Secondary Diagonal: The row-column condition is row = numberOfRows – column -1.

How do you extract the diagonal elements of a matrix in python?

NumPy diag() function in Python is used to extract a diagonal or construct a diagonal array. This function takes an array and k as parameters and returns the diagonal array from the given array.


1 Answers

An unaligned loop is one where the first instruction doesn't start on a particular boundary (multiple of 16 or 32). There should be a compiler flag to align loops; it may or may not help performance. Whether a loop is aligned or not without the flag is based on exactly which instructions come before it, so it's not predictable. One other optimization you could try is to mark diag_mask, spJ, and sp_diag_ind as restrict (a C99 feature). That indicates that they do not alias and might help the compiler to optimize the loop better. A count of 98 will probably be too small to see any effect, though.

like image 185
Jeremiah Willcock Avatar answered Oct 23 '22 01:10

Jeremiah Willcock