I'm testing the performance speedup of some algorithms when using OpenMP and one of then is not scaling. Am I doing something wrong?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int main(int argc, char **argv) {
int test_size, i;
double *vector, mean, stddeviation, start_time, duration;
if (argc != 2) {
printf("Usage: %s <test_size>\n", argv[0]);
return 1;
}
srand((int) omp_get_wtime());
test_size = atoi(argv[1]);
printf("Test Size: %d\n", test_size);
vector = (double *) malloc(test_size * sizeof(double));
for (i = 0; i < test_size; i++) {
vector[i] = rand();
}
start_time = omp_get_wtime();
mean = 0;
stddeviation = 0;
#pragma omp parallel default(shared) private(i)
{
#pragma omp for reduction(+:mean)
for (i = 0; i < test_size; i++) {
mean += vector[i];
}
#pragma omp single
mean /= test_size;
#pragma omp for reduction(+:stddeviation)
for (i = 0; i < test_size; i++) {
stddeviation += (vector[i] - mean)*(vector[i] - mean);
}
}
stddeviation = sqrt(stddeviation / test_size);
duration = omp_get_wtime() - start_time;
printf("Std. Deviation = %lf\n", stddeviation);
printf("Duration: %fms\n", duration*1000);
return 0;
}
gcc -c -o main.o main.c -fopenmp -lm -O3
gcc -o dp main.o -fopenmp -lm -O3
$ OMP_NUM_THREADS=1 ./dp 100000000
166.224199ms
$ OMP_NUM_THREADS=2 ./dp 100000000
157.924034ms
$ OMP_NUM_THREADS=4 ./dp 100000000
159.056189ms
I am not reproducing your results with Ubuntu 14.04.2 LTS, gcc 4.8, and a 2.3 GHz Intel Core i7. Here are the results that I get:
$ OMP_NUM_THREADS=1 ./so30627170 100000000 Test Size: 100000000 Std. Deviation = 619920018.463329 Duration: 206.301721ms $ OMP_NUM_THREADS=2 ./so30627170 100000000 Test Size: 100000000 Std. Deviation = 619901821.463117 Duration: 110.381279ms $ OMP_NUM_THREADS=4 ./so30627170 100000000 Test Size: 100000000 Std. Deviation = 619883614.594906 Duration: 78.241708ms
Because the output listed in the "Results" section of your question could not match the output from the code as listed, you may be running an old version of your code.
I thought about possibly using X86 intrinsics within the parallel for
loops, but examining the assembly output, gcc already uses SIMD instructions in this case. Without march options, I was seeing gcc use SSE2 instructions. Compiling with -march=native
or -mavx
, gcc would use AVX instructions.
EDIT: Running the Go version of your program, I get:
$ ./tcc-go-desvio-padrao -w 1 -n 15 -t 100000000 2015/06/07 08:26:43 Workers: 1 2015/06/07 08:26:43 Tests: [100000000] 2015/06/07 08:26:43 # of executions of each test: 15 2015/06/07 08:26:43 Time to allocate memory: 584.477µs 2015/06/07 08:26:43 =========================================== 2015/06/07 08:26:43 Current test size: 100000000 2015/06/07 08:27:05 Time to fill the array: 1.322556083s 2015/06/07 08:27:05 Time to calculate: 194.10728ms $ ./tcc-go-desvio-padrao -w 2 -n 15 -t 100000000 2015/06/07 08:27:10 Workers: 2 2015/06/07 08:27:10 Tests: [100000000] 2015/06/07 08:27:10 # of executions of each test: 15 2015/06/07 08:27:10 Time to allocate memory: 565.273µs 2015/06/07 08:27:10 =========================================== 2015/06/07 08:27:10 Current test size: 100000000 2015/06/07 08:27:22 Time to fill the array: 677.755324ms 2015/06/07 08:27:22 Time to calculate: 113.095753ms $ ./tcc-go-desvio-padrao -w 4 -n 15 -t 100000000 2015/06/07 08:27:28 Workers: 4 2015/06/07 08:27:28 Tests: [100000000] 2015/06/07 08:27:28 # of executions of each test: 15 2015/06/07 08:27:28 Time to allocate memory: 576.568µs 2015/06/07 08:27:28 =========================================== 2015/06/07 08:27:28 Current test size: 100000000 2015/06/07 08:27:34 Time to fill the array: 353.646193ms 2015/06/07 08:27:34 Time to calculate: 79.86221ms
The timings appear about the same as the OpenMP version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With