I am trying to speed up a sparse matrix-vector product using open mp, the code is as follows:
void zAx(double * z, double * data, long * colind, long * row_ptr, double * x, int M){
long i, j, ckey;
int chunk = 1000;
//int * counts[8]={0};
#pragma omp parallel num_threads(8)
{
#pragma omp for private(ckey,j,i) schedule(static,chunk)
for (i=0; i<M; i++ ){
z[i]=0;
for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
j = colind[ckey];
z[i] += data[ckey]*x[j];
}
}
}
}
Now, this code runs fine, and produces the correct result, it however only gives me a speed up of ~30%. I have checked to see that the threads are all getting about the same number of non-zero elements (they are), and the matrix is fairly large (300,000 x 300,000), so I'm hoping the overhead isn't the only issue. I've also tried running with different chunk sizes and thread numbers, and I get similar performance.
Is there something else I could try to get a bit of extra speed out of this? Or something I'm obviously doing wrong?
Cheers.
Edit: Just commented out '//int * counts[8]={0}', because it was leftover from counting the work allocation. Not needed
Edit2 (more details):
Okay so I timed a loop of calling this 5000 times and got the average times:
The matrix is of size: 303544x303544 and has: 2122980 non-zero elements.
With a much smaller matrix 30000x30000 I get times that go more like
So it seems the large size may be my issue.
Welcome to the wonderful world of memory-bound problems. To relieve you from your pain, I would like to inform you, that sparse matrix-vector multiplication is one of the many things that cannot be effectively parallelised or even vectorised on a single multi-core chip, unless all data could fit in the last level cache or the memory bus is really really wide.
Why? Simply because the ratio of computation to memory access is extremely low. For each iteration of the inner loop you fetch once the column index into j
(8 bytes), the matrix element into data
(8 bytes), the value of the vector element (8 bytes) and the previous value of the result (since compilers rarely optimise access to shared variables) (8 bytes). Then you perform 2 very fast floating point operations (FLOPs) and perform a store (although the +=
operator is translated to a single instruction, it is still a "fetch-modify-write" one). In total you load 32 bytes and you perform 2 FLOPs over them. This makes 1/16 FLOPs per byte.
A modern SSE-capable CPU core can perform 4 double-precision FLOPs/cycle, which often results to something like 8 GFLOPS per CPU core (assuming basic frequency of 2 GHz). With AVX the number is doubled, so you get up to 16 GFLOPS per core on a 2 GHz Intel Sandy/Ivy Bridge or AMD equivalent. In order to saturate this processing power with data, given the 1/16 FLOPs/byte, you would need at least 128 GiB/s of memory bandwidth.
A high-end Nehalem-EX processor like Xeon X7560 runs at 2,26 GHz (9,04 GFLOPS/core) and its shared L3 cache (L1 and L2 caches are per-core) delivers about 275 GiB/s. At 9,04 GFLOPS/core, you need 144,64 GiB/s per core in order to feed the inner loop of your zAx
routine. This means that in the ideal case, the L3 cache of this CPU cannot feed more than 2 fully vectorised multiplication kernels.
Without SSE vectorisation the FLOPS rate is twice as lower for double precision, so one could expect the problem to scale up to 4 threads. Things get extremely bad once your problem becomes larger than the L3 cache as the memory bus delivers about ten times less bandwidth.
Try the following version of the inner loop just to see if the compiler is smart enough to follow the relaxed memory view of OpenMP:
#pragma omp for private(ckey,j) schedule(static,chunk)
for (i=0; i<M; i++){
double zi = 0.0;
for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
j = colind[ckey];
zi += data[ckey]*x[j];
}
z[i] = zi;
}
Unfortunately there is nothing more that you can do. Sparse matrix-vector multiplication scales with the number of CPU sockets, not with the number of CPU cores. You would need a multi-socket system with separate memory controllers, e.g. any system with more than one (post-)Nehalem or AMD64 processors.
Edit: optimisation hint. Do you really need long
to store the column index and the row pointers? With 2122980 non-zero elements you would be pretty fine using int
instead. It would save loading 4 bytes per element in the inner loop and another 4 bytes per row in the outer loop.
I can't write this in a comment so I'll do it as an answer. I think it is the problem but not 100% sure.
Shared variables across threads can cause issues. I do not think it is a problem here but might be. Usually only when writing, but if there are no locks then it will just result in corrupt data. Not sure if OpenMP does any locking internally or not.
Chances are your threads are stalling due to locks which is the only reason why multi-threading runs much slower in proportion to a single thread. That or it is not your code at all. It is best if you test it on a small data set that is in memory with no potentially for bottlenecks(so all you're doing is processing the data and timing only the zAx function).
0.3M^2 = 90B. Which means you are definitely going to have issues with either paging or loading the files. (and if you are using int's that 4's times the size)
A better approach might be to operate on X amount of the matrix while the disk loads Y amount in parallel. By choosing X and Y correctly you will not have much reduction in speed. If you are loading 8GB, processing, then loading 8GB more you have to wait each time to load the data.
You can make the processing intelligent by choosing X and Y = (8GB - X) by monitoring the time the processing and file loading are doing nothing.
To check and see if the disk access is the problem try using a smaller dataset and time only zAx and see if that helps. If it does then it's the disk. If not, it's the code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With