I have what seems to be a very simple parallel for
loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop gets. I thought that this was due to some cache thrashing so I played around with schedules, chunk size, __restrict__
, nesting the parallel for inside a parallel block, and flushes. Then I noticed that reading an array to do a reduction is also slower.
This should be obviously very simple and should speed up nearly linearly. What am I missing here?
Full code:
#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>
void tic(), toc();
int main(int argc, const char *argv[])
{
const int COUNT = 100;
const size_t sz = 250000 * 200;
std::vector<int> vec(sz, 1);
std::cout << "max threads: " << omp_get_max_threads()<< std::endl;
std::cout << "serial reduction" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
double sum = 0;
for(size_t i = 0; i < sz; ++ i)
sum += vec[i];
}
toc();
int *const ptr = vec.data();
const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int
std::cout << "parallel reduction" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
double sum = 0;
#pragma omp parallel for default(none) reduction(+:sum)
for(int i = 0; i < sz_i; ++ i)
sum += ptr[i];
}
toc();
std::cout << "serial memset" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
for(size_t i = 0; i < sz; ++ i)
vec[i] = 0;
}
toc();
std::cout << "parallel memset" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for default(none)
for(int i = 0; i < sz_i; ++ i)
ptr[i] = 0;
}
toc();
return 0;
}
static clock_t ClockCounter;
void tic()
{
ClockCounter = std::clock();
}
void toc()
{
ClockCounter = std::clock() - ClockCounter;
std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}
Running this yields:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 1790000
parallel reduction
elapsed clock ticks: 19690000
serial memset
elapsed clock ticks: 3860000
parallel memset
elapsed clock ticks: 20800000
If I run with -O2
, g++ is able to optimize the serial reduction away and I get zero time, thus -O1
. Also, putting omp_set_num_threads(1);
makes the times more similar, although there are still some differences:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 1
serial reduction
elapsed clock ticks: 1770000
parallel reduction
elapsed clock ticks: 7370000
serial memset
elapsed clock ticks: 2290000
parallel memset
elapsed clock ticks: 3550000
This should be fairly obvious, I feel like I'm not seeing something very elementary. My CPU is Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz with hyperthreading, but the same behavior is observed at a colleague's i5 with 4 cores and no hyperthreading. We're both running Linux.
EDIT
It seems that one err was on the side of timing, running with:
static double ClockCounter;
void tic()
{
ClockCounter = omp_get_wtime();//std::clock();
}
void toc()
{
ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter;
std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}
yields more "reasonable" times:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 1.80974
parallel reduction
elapsed clock ticks: 2.07367
serial memset
elapsed clock ticks: 2.37713
parallel memset
elapsed clock ticks: 2.23609
But still, there is no speedup, it is just not slower anymore.
EDIT2:
As suggested by user8046, the code is heavily memory bound. and as suggested by Z boson, the serial code is easily optimized away and it is not certain what is being measured here. So I did a small change of putting the sum outside of the loop, so that it does not zero out at every iteration over c
. I also replaced the reduction operation with sum+=F(vec[i])
and memset operation with vec[i] = F(i)
. Running as:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))"
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 23.9106
parallel reduction
elapsed clock ticks: 3.35519
serial memset
elapsed clock ticks: 43.7344
parallel memset
elapsed clock ticks: 6.50351
Calculating the square root adds more work to the threads and there is finally some reasonable speedup (it is about 7x
, which makes sense, as the hyperthreaded cores share memory lanes).
You spotted the timing error. There is still no speedup because both of your test cases are heavily memory bound. On typical consumer hardware all of your cores share one memory bus, so using more threads does not give you more bandwidth and, since this is the bottleneck, speedup. This will probably change if you reduce your problem size so it will fit into cache or for sure if you increase the number of calculations per data, for example if you were calculating the reduction of exp(vec[i]) or 1/vec[i]. For the memset: you can saturate the memory with one thread, you will never see a speedup there. (Only if you have access to a second memory bus with more threads, as with some multi-socket architectures). One remark regarding the reduction, this is most probably not implemented with a lock, that would be horrible inefficient but using an addition tree which has not so bad logarithmic speedup.
Besides your error in using the clock function in Linux the rest of your question can be answered by reading these questions/answers.
in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987
measuring-memory-bandwidth-from-the-dot-product-of-two-arrays
memset-in-parallel-with-threads-bound-to-each-physical-core
So you should see a significant benefit from multiple threads with memset and doing a reduction even on a single socket system. I have written my own tool to measure bandwidth for this. You can find some of the results from my my i5-4250U (Haswell) with 2 cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) runing over 1 GB. vsum
is your reduction. Notice that there is a significant improvement even on this two core system.
one thread
C standard library
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.80 6.68 0.00 inf %
memcpy: 1.00 1.35 7.93 0.00 inf %
Agner Fog's asmlib
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.71 7.53 0.00 inf %
memcpy: 1.00 0.93 11.51 0.00 inf %
my_memset
0.50 0.71 7.53 0.00 inf %
FMA3 reduction tests
GB time(s) GB/s GFLOPS efficiency
vsum: 0.50 0.53 10.08 2.52 inf %
vmul: 0.50 0.68 7.93 1.98 inf %
vtriad: 0.50 0.70 7.71 3.85 inf %
dot 1.00 1.08 9.93 2.48 inf %
two threads
C standard library
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.64 8.33 0.00 inf %
memcpy: 1.00 1.10 9.76 0.00 inf %
Agner Fog's asmlib
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.36 14.98 0.00 inf %
memcpy: 1.00 0.66 16.30 0.00 inf %
my_memset
0.50 0.36 15.03 0.00 inf %
FMA3 tests
standard sum tests with OpenMP: 2 threads
GB time(s) GB/s GFLOPS efficiency
vsum: 0.50 0.41 13.03 3.26 inf %
vmul: 0.50 0.39 13.67 3.42 inf %
vtriad: 0.50 0.44 12.20 6.10 inf %
dot 1.00 0.97 11.11 2.78 inf %
Here is my custom memset function (I have several other tests like this).
void my_memset(int *s, int c, size_t n) {
int i;
__m128i v = _mm_set1_epi32(c);
#pragma omp parallel for
for(i=0; i<n/4; i++) {
_mm_stream_si128((__m128i*)&s[4*i], v);
}
}
Edit:
You should compile with -O3
and -ffast-math
. Define the sum outside of the outerloop and then print it out so GCC does not optimize it away. GCC won't auto-vectorize a reduction because floating point arithmetic is not associative and vectorizing the loop could break IEEE floating point rules. Using -ffast-math
allows floating point arithemetic to be associative which allows GCC to vectorize the reduction. It should be pointed out that already doing a reduction in OpenMP assumes the floating point arithmetic is associative so it already break IEEE floating point rules.
double sum = 0;
tic();
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for reduction(+:sum)
for(int i = 0; i < sz_i; ++ i)
sum += ptr[i];
}
toc();
printf("sum %f\n", sum);
Edit:
I tested your code and made some modifications. I get faster times with the reduction and memset using multiple threads
max threads: 4
serial reduction
dtime 1.86, sum 705032704
parallel reduction
dtime 1.39 s, sum 705032704
serial memset
dtime 2.95 s
parallel memset
dtime 2.44 s
serial my_memset
dtime 2.66 s
parallel my_memset
dtime 1.35 s
Here is the code I used (g++ foo.cpp -fopenmp -O3 -ffast-math)
#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>
#include <stdio.h>
#include <xmmintrin.h>
void my_memset(int *s, int c, size_t n) {
int i;
__m128i v = _mm_set1_epi32(c);
for(i=0; i<n/4; i++) {
_mm_stream_si128((__m128i*)&s[4*i], v);
}
}
void my_memset_omp(int *s, int c, size_t n) {
int i;
__m128i v = _mm_set1_epi32(c);
#pragma omp parallel for
for(i=0; i<n/4; i++) {
_mm_stream_si128((__m128i*)&s[4*i], v);
}
}
int main(int argc, const char *argv[])
{
const int COUNT = 100;
const size_t sz = 250000 * 200;
std::vector<int> vec(sz, 1);
std::cout << "max threads: " << omp_get_max_threads()<< std::endl;
std::cout << "serial reduction" << std::endl;
double dtime;
int sum;
dtime = -omp_get_wtime();
sum = 0;
for(int c = 0; c < COUNT; ++ c) {
for(size_t i = 0; i < sz; ++ i)
sum += vec[i];
}
dtime += omp_get_wtime();
printf("dtime %.2f, sum %d\n", dtime, sum);
int *const ptr = vec.data();
const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int
std::cout << "parallel reduction" << std::endl;
dtime = -omp_get_wtime();
sum = 0;
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for default(none) reduction(+:sum)
for(int i = 0; i < sz_i; ++ i)
sum += ptr[i];
}
dtime += omp_get_wtime();
printf("dtime %.2f s, sum %d\n", dtime, sum);
std::cout << "serial memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) {
for(size_t i = 0; i < sz; ++ i)
vec[i] = 0;
}
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
std::cout << "parallel memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for default(none)
for(int i = 0; i < sz_i; ++ i)
ptr[i] = 0;
}
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
std::cout << "serial my_memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
std::cout << "parallel my_memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
return 0;
}
You are using std::clock which reports CPU time used, not real time. As such each processor's time is added up and will always be higher than single threaded time (due to overhead).
http://en.cppreference.com/w/cpp/chrono/c/clock
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With