Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running over an unrolled linked list takes around 40% of the code runtime - are there any obvious ways to optimise it?

I am the author of an open source scientific code called vampire ( http://github.com/richard-evans/vampire ), and being compute intensive means any improvement in code performance can significantly increase the amount of research that can be done. Typical runtimes of this code can be hundreds of core hours, so I am always looking for ways to improve the performance critical sections of the code. However, I have come a bit stuck with the following bit of relatively innocuous looking bit of code, which makes up around 40% of the runtime:

for (int atom = start_index; atom < end_index; atom++){
    register double Hx = 0.0;
    register double Hy = 0.0;
    register double Hz = 0.0;
    const int start = atoms::neighbour_list_start_index[atom];
    const int end = atoms::neighbour_list_end_index[atom] + 1;
    for (int nn = start; nn < end; nn++){
        const int natom = atoms::neighbour_list_array[nn];
        const double Jij = atoms::i_exchange_list[atoms::neighbour_interaction_type_array[nn]].Jij;
        Hx -= Jij * atoms::x_spin_array[natom];
        Hy -= Jij * atoms::y_spin_array[natom];
        Hz -= Jij * atoms::z_spin_array[natom];
    }
    atoms::x_total_spin_field_array[atom] += Hx;
    atoms::y_total_spin_field_array[atom] += Hy;
    atoms::z_total_spin_field_array[atom] += Hz;
}

The high level overview of the function and variables of this code is as follows: There is a 1D array of a physical vector ( split into three 1D arrays for each component x,y,z for memory caching purposes, atoms::x_spin_array, etc) called 'spin'. Each of these spins interact with some other spins, and all the interactions are stored as a 1D neighbour list (atoms::neighbour_list_array). The relevant list of interactions for each atom is determined by a start and end index of the neighbor listarray in two separate arrays. At the end of the calculation the each atomic spin has an effective field which is the vector sum of the interactions.

Given the small amount of code and the sizable fraction of the runtime it occupies I have done by best, but I feel there must be a way to optimize this further, but as a physicist rather than a computer scientist maybe I am missing something?

like image 785
rfle500 Avatar asked Jul 22 '13 18:07

rfle500


2 Answers

You've got a constant stream of multiply, subtract and adds on contiguous data. That's seems like an ideal use of SSE. If it's memory bandwidth limited, then OpenCL/CUDA instead.

Try using this library if you aren't familiar with all the low level instructions.

That inner loop could potentially then be restructured significantly maybe leading to speed ups.

like image 115
Delta_Fore Avatar answered Sep 18 '22 16:09

Delta_Fore


If the x, y, z components are indeed linked lists, doing x[i], y[i] and z[i] will cause the lists to be traversed multiple times, giving (n^2)/2 iterations. Using vectors will make this an O(1) operation.

You mention that the three coordinates are split out for memory caching purposes, but this will affect the Level 1 and Level 2 cache locality as you are accessing 3 different areas in memory. The linked list is also impacting your cache locality.

Using something like:

struct vector3d {
    double x;
    double y;
    double z;
};

std::vector<vector3d> spin;
std::vector<vector3d> total_spin;

This should improve the cache locality, as the x, y and z values are adjacent in memory and the spins occupy a linear block of memory.

like image 39
reece Avatar answered Sep 22 '22 16:09

reece