The dot product of two arrays <pre class="prettyprint"><code>for(int i=0; i<n; i++) { sum += x[i]*y[i]; } </code></pre> does not reuse data so it should be a memory bound operation. Therefore, I should be able to measure the memory bandwidth from the dot product. Using the code at why-vectorizing-the-loop-does-not-have-performance-improvement I get a bandwidth of 9.3 GB/s for my system. However, when I attempt to calculate the bandwidth using the dot product I get over twice the rate for a single thread and over three time the rate using multiple threads (my system has four cores/eight hyper-threads). This makes no sense to me since a memory bound operation should not benefit from multiple threads. Here is the output from the code below: <pre class="prettyprint"><code>Xeon E5-1620, GCC 4.9.0, Linux kernel 3.13 dot 1 thread: 1.0 GB, sum 191054.81, time 4.98 s, 21.56 GB/s, 5.39 GFLOPS dot_avx 1 thread 1.0 GB, sum 191043.33, time 5.16 s, 20.79 GB/s, 5.20 GFLOPS dot_avx 2 threads: 1.0 GB, sum 191045.34, time 3.44 s, 31.24 GB/s, 7.81 GFLOPS dot_avx 8 threads: 1.0 GB, sum 191043.34, time 3.26 s, 32.91 GB/s, 8.23 GFLOPS </code></pre> Can somebody please explain to me why I get over twice the bandwidth for one thread and over three times the bandwidth using more than one thread? Here is the code I used: <pre class="prettyprint"><code>//g++ -O3 -fopenmp -mavx -ffast-math dot.cpp #include <stdio.h> #include <string.h> #include <stdlib.h> #include <stdint.h> #include <x86intrin.h> #include <omp.h> extern "C" inline float horizontal_add(__m256 a) { __m256 t1 = _mm256_hadd_ps(a,a); __m256 t2 = _mm256_hadd_ps(t1,t1); __m128 t3 = _mm256_extractf128_ps(t2,1); __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3); return _mm_cvtss_f32(t4); } extern "C" float dot_avx(float * __restrict x, float * __restrict y, const int n) { x = (float*)__builtin_assume_aligned (x, 32); y = (float*)__builtin_assume_aligned (y, 32); float sum = 0; #pragma omp parallel reduction(+:sum) { __m256 sum1 = _mm256_setzero_ps(); __m256 sum2 = _mm256_setzero_ps(); __m256 sum3 = _mm256_setzero_ps(); __m256 sum4 = _mm256_setzero_ps(); __m256 x8, y8; #pragma omp for for(int i=0; i<n; i+=32) { x8 = _mm256_loadu_ps(&x[i]); y8 = _mm256_loadu_ps(&y[i]); sum1 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum1); x8 = _mm256_loadu_ps(&x[i+8]); y8 = _mm256_loadu_ps(&y[i+8]); sum2 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum2); x8 = _mm256_loadu_ps(&x[i+16]); y8 = _mm256_loadu_ps(&y[i+16]); sum3 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum3); x8 = _mm256_loadu_ps(&x[i+24]); y8 = _mm256_loadu_ps(&y[i+24]); sum4 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum4); } sum += horizontal_add(_mm256_add_ps(_mm256_add_ps(sum1,sum2),_mm256_add_ps(sum3,sum4))); } return sum; } extern "C" float dot(float * __restrict x, float * __restrict y, const int n) { x = (float*)__builtin_assume_aligned (x, 32); y = (float*)__builtin_assume_aligned (y, 32); float sum = 0; for(int i=0; i<n; i++) { sum += x[i]*y[i]; } return sum; } int main(){ uint64_t LEN = 1 << 27; float *x = (float*)_mm_malloc(sizeof(float)*LEN,64); float *y = (float*)_mm_malloc(sizeof(float)*LEN,64); for(uint64_t i=0; i<LEN; i++) { x[i] = 1.0*rand()/RAND_MAX - 0.5; y[i] = 1.0*rand()/RAND_MAX - 0.5;} uint64_t size = 2*sizeof(float)*LEN; volatile float sum = 0; double dtime, rate, flops; int repeat = 100; dtime = omp_get_wtime(); for(int i=0; i<repeat; i++) sum += dot(x,y,LEN); dtime = omp_get_wtime() - dtime; rate = 1.0*repeat*size/dtime*1E-9; flops = 2.0*repeat*LEN/dtime*1E-9; printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops); sum = 0; dtime = omp_get_wtime(); for(int i=0; i<repeat; i++) sum += dot_avx(x,y,LEN); dtime = omp_get_wtime() - dtime; rate = 1.0*repeat*size/dtime*1E-9; flops = 2.0*repeat*LEN/dtime*1E-9; printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops); } </code></pre> I just downloaded, complied, and ran STREAM as suggested by Jonathan Dursi and here are the results: One thread <pre class="prettyprint"><code>Function Rate (MB/s) Avg time Min time Max time Copy: 14292.1657 0.0023 0.0022 0.0023 Scale: 14286.0807 0.0023 0.0022 0.0023 Add: 14724.3906 0.0033 0.0033 0.0033 Triad: 15224.3339 0.0032 0.0032 0.0032 </code></pre> Eight threads <pre class="prettyprint"><code>Function Rate (MB/s) Avg time Min time Max time Copy: 24501.2282 0.0014 0.0013 0.0021 Scale: 23121.0556 0.0014 0.0014 0.0015 Add: 25263.7209 0.0024 0.0019 0.0056 Triad: 25817.7215 0.0020 0.0019 0.0027 </code></pre>

There's a few things going on here, that come down to: <ul> <li>You have to work fairly hard to get every last bit of performance out of the memory subsystem; and </li> <li>Different benchmarks measure different things.</li> </ul> The first helps explain why you need multiple threads to saturate the available memory bandwidth. There is a lot of concurrency in the memory system, and it taking advantage of that will often require some concurrency in your CPU code. One big reason that multiple threads of execution help is latency hiding - while one thread is stalled waiting for data to arrive, another thread may be able to take advantage of some other data that has just become available. The hardware helps you a lot on a single thread in this case - because the memory access is so predictable, the hardware can prefetch the data ahead of when you need it, giving you some of the advantage of latency hiding even with one thread; but there are limits to what prefetch can do. The prefetcher won't take it upon itself to cross page boundaries, for instance. The canonical reference for much of this is What Every Programmer Should Know About Memory by Ulrich Drepper, which is now old enough that some gaps are starting to show (Intel's Hot Chips overview of your Sandy Bridge processor is here - note in particular the tighter integration of the memory management hardware with the CPU). As to the question about comparing with memset, mbw or STREAM, comparing across benchmarks will always cause headaches, even benchmarks that claim to be measuring the same thing. In particular, "memory bandwidth" isn't a single number - performance varies quite a bit depending on the operations. Both mbw and Stream do some version of a copy operation, with STREAMs operations being spelled out here (taken straight from the web page, all operands are double-precision floating points): <pre class="prettyprint"><code>------------------------------------------------------------------ name kernel bytes/iter FLOPS/iter ------------------------------------------------------------------ COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 ------------------------------------------------------------------ </code></pre> so roughly 1/2-1/3 of the memory operations in these cases are writes (and everything's a write in the case of memset). While individual writes can be a little slower than reads, the bigger issue is that it's much harder to saturate the memory subsystem with writes because of course you can't do the equivalent of prefetching a write. Interleaving the reads and writes helps, but your dot-product example which is essentially all reads is going to be about the best-possible case for pegging the needle on memory bandwidth. In addition, the STREAM benchmark is (intentionally) written completely portably, with only some compiler pragmas to suggest vectorization, so beating the STREAM benchmark isn't necessarily a warning sign, especially when what you're doing is two streaming reads.

Measuring memory bandwidth from the dot product of two arrays

Tags:

The dot product of two arrays

for(int i=0; i<n; i++) {
    sum += x[i]*y[i];
}

does not reuse data so it should be a memory bound operation. Therefore, I should be able to measure the memory bandwidth from the dot product.

Using the code at why-vectorizing-the-loop-does-not-have-performance-improvement I get a bandwidth of 9.3 GB/s for my system. However, when I attempt to calculate the bandwidth using the dot product I get over twice the rate for a single thread and over three time the rate using multiple threads (my system has four cores/eight hyper-threads). This makes no sense to me since a memory bound operation should not benefit from multiple threads. Here is the output from the code below:

Xeon E5-1620, GCC 4.9.0, Linux kernel 3.13
dot 1 thread:      1.0 GB, sum 191054.81, time 4.98 s, 21.56 GB/s, 5.39 GFLOPS
dot_avx 1 thread   1.0 GB, sum 191043.33, time 5.16 s, 20.79 GB/s, 5.20 GFLOPS
dot_avx 2 threads: 1.0 GB, sum 191045.34, time 3.44 s, 31.24 GB/s, 7.81 GFLOPS
dot_avx 8 threads: 1.0 GB, sum 191043.34, time 3.26 s, 32.91 GB/s, 8.23 GFLOPS

Can somebody please explain to me why I get over twice the bandwidth for one thread and over three times the bandwidth using more than one thread?

Here is the code I used:

//g++ -O3 -fopenmp -mavx -ffast-math dot.cpp
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <x86intrin.h>
#include <omp.h>

extern "C" inline float horizontal_add(__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);
}

extern "C" float dot_avx(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    #pragma omp parallel reduction(+:sum)
    {
        __m256 sum1 = _mm256_setzero_ps();
        __m256 sum2 = _mm256_setzero_ps();
        __m256 sum3 = _mm256_setzero_ps();
        __m256 sum4 = _mm256_setzero_ps();
        __m256 x8, y8;
        #pragma omp for
        for(int i=0; i<n; i+=32) {
            x8 = _mm256_loadu_ps(&x[i]);
            y8 = _mm256_loadu_ps(&y[i]);
            sum1 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum1);
            x8 = _mm256_loadu_ps(&x[i+8]);
            y8 = _mm256_loadu_ps(&y[i+8]);
            sum2 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum2);
            x8 = _mm256_loadu_ps(&x[i+16]);
            y8 = _mm256_loadu_ps(&y[i+16]);
            sum3 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum3);
            x8 = _mm256_loadu_ps(&x[i+24]);
            y8 = _mm256_loadu_ps(&y[i+24]);
            sum4 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum4);
        }
        sum += horizontal_add(_mm256_add_ps(_mm256_add_ps(sum1,sum2),_mm256_add_ps(sum3,sum4)));
    }
    return sum; 
}

extern "C" float dot(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    for(int i=0; i<n; i++) {
        sum += x[i]*y[i];
    }
    return sum;
}

int main(){
    uint64_t LEN = 1 << 27;
    float *x = (float*)_mm_malloc(sizeof(float)*LEN,64);
    float *y = (float*)_mm_malloc(sizeof(float)*LEN,64);
    for(uint64_t i=0; i<LEN; i++) { x[i] = 1.0*rand()/RAND_MAX - 0.5; y[i] = 1.0*rand()/RAND_MAX - 0.5;}

    uint64_t size = 2*sizeof(float)*LEN;

    volatile float sum = 0;
    double dtime, rate, flops;  
    int repeat = 100;

    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;
    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);

    sum = 0;
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot_avx(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;

    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);
}

I just downloaded, complied, and ran STREAM as suggested by Jonathan Dursi and here are the results:

One thread

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       14292.1657       0.0023       0.0022       0.0023
Scale:      14286.0807       0.0023       0.0022       0.0023
Add:        14724.3906       0.0033       0.0033       0.0033
Triad:      15224.3339       0.0032       0.0032       0.0032

Eight threads

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       24501.2282       0.0014       0.0013       0.0021
Scale:      23121.0556       0.0014       0.0014       0.0015
Add:        25263.7209       0.0024       0.0019       0.0056
Triad:      25817.7215       0.0020       0.0019       0.0027

304

asked Aug 07 '14 10:08

Z boson

2 Answers

There's a few things going on here, that come down to:

You have to work fairly hard to get every last bit of performance out of the memory subsystem; and
Different benchmarks measure different things.

The first helps explain why you need multiple threads to saturate the available memory bandwidth. There is a lot of concurrency in the memory system, and it taking advantage of that will often require some concurrency in your CPU code. One big reason that multiple threads of execution help is latency hiding - while one thread is stalled waiting for data to arrive, another thread may be able to take advantage of some other data that has just become available.

The hardware helps you a lot on a single thread in this case - because the memory access is so predictable, the hardware can prefetch the data ahead of when you need it, giving you some of the advantage of latency hiding even with one thread; but there are limits to what prefetch can do. The prefetcher won't take it upon itself to cross page boundaries, for instance. The canonical reference for much of this is What Every Programmer Should Know About Memory by Ulrich Drepper, which is now old enough that some gaps are starting to show (Intel's Hot Chips overview of your Sandy Bridge processor is here - note in particular the tighter integration of the memory management hardware with the CPU).

As to the question about comparing with memset, mbw or STREAM, comparing across benchmarks will always cause headaches, even benchmarks that claim to be measuring the same thing. In particular, "memory bandwidth" isn't a single number - performance varies quite a bit depending on the operations. Both mbw and Stream do some version of a copy operation, with STREAMs operations being spelled out here (taken straight from the web page, all operands are double-precision floating points):

------------------------------------------------------------------
name        kernel                  bytes/iter      FLOPS/iter
------------------------------------------------------------------
COPY:       a(i) = b(i)                 16              0
SCALE:      a(i) = q*b(i)               16              1
SUM:        a(i) = b(i) + c(i)          24              1
TRIAD:      a(i) = b(i) + q*c(i)        24              2
------------------------------------------------------------------

so roughly 1/2-1/3 of the memory operations in these cases are writes (and everything's a write in the case of memset). While individual writes can be a little slower than reads, the bigger issue is that it's much harder to saturate the memory subsystem with writes because of course you can't do the equivalent of prefetching a write. Interleaving the reads and writes helps, but your dot-product example which is essentially all reads is going to be about the best-possible case for pegging the needle on memory bandwidth.

In addition, the STREAM benchmark is (intentionally) written completely portably, with only some compiler pragmas to suggest vectorization, so beating the STREAM benchmark isn't necessarily a warning sign, especially when what you're doing is two streaming reads.

112

answered Jan 12 '23 04:01

Jonathan Dursi

I made my own memory benchmark code https://github.com/zboson/bandwidth

Here are the current results for eight threads:

write:    0.5 GB, time 2.96e-01 s, 18.11 GB/s
copy:       1 GB, time 4.50e-01 s, 23.85 GB/s
scale:      1 GB, time 4.50e-01 s, 23.85 GB/s
add:      1.5 GB, time 6.59e-01 s, 24.45 GB/s
mul:      1.5 GB, time 6.56e-01 s, 24.57 GB/s
triad:    1.5 GB, time 6.61e-01 s, 24.37 GB/s
vsum:     0.5 GB, time 1.49e-01 s, 36.09 GB/s, sum -8.986818e+03
vmul:     0.5 GB, time 9.00e-05 s, 59635.10 GB/s, sum 0.000000e+00
vmul_sum:   1 GB, time 3.25e-01 s, 33.06 GB/s, sum 1.910421e+04

Here are the currents results for 1 thread:

write:    0.5 GB, time 4.65e-01 s, 11.54 GB/s
copy:       1 GB, time 7.51e-01 s, 14.30 GB/s
scale:      1 GB, time 7.45e-01 s, 14.41 GB/s
add:      1.5 GB, time 1.02e+00 s, 15.80 GB/s
mul:      1.5 GB, time 1.07e+00 s, 15.08 GB/s
triad:    1.5 GB, time 1.02e+00 s, 15.76 GB/s
vsum:     0.5 GB, time 2.78e-01 s, 19.29 GB/s, sum -8.990941e+03
vmul:     0.5 GB, time 1.15e-05 s, 468719.08 GB/s, sum 0.000000e+00
vmul_sum:   1 GB, time 5.72e-01 s, 18.78 GB/s, sum 1.910549e+04

write: writes a constant (3.14159) to an array. This should be like memset.
copy, scale, add, and triad are defined the same as in STREAM
mul: a(i) = b(i) * c(i)
vsum: sum += a(i)
vmul: sum *= a(i)
vmul_sum: sum += a(i)*b(i) // the dot product

My results are consistent with STREAM. I get the highest bandwidth for vsum. The vmul method does not work currently (once the value is zero it finishes early). I can get slightly better results (by about 10%) using intrinsics and unrolling the loop which I will add later.

answered Jan 12 '23 03:01

Z boson

Related questions
                            
                                Reading .Net Stack Trace [duplicate]
                            
                                Vertical alignment based on x-height
                            
                                Valgrind on MIPS Reports no Heap Usage
                            
                                Multiple JVMs vs single app server
                            
                                Can fusion see through newtype wrappers?
                            
                                cx_Freeze - Preventing including unneeded packages
                            
                                scope of eval function in python
                            
                                How can I mix vertically-centered elements with different font sizes and retain consistent line height?
                            
                                Running npm behind a corporate firewall: what do I need to tell the security team?
                            
                                Angular UI Grid: How to create a pre-populated dropdown menu for column filtering
                            
                                Android Studio documentation auto popup issue
                            
                                Did I find a bug in java.io.PipedInputStream?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Measuring memory bandwidth from the dot product of two arrays

Tags:

Z boson

People also ask

2 Answers

Jonathan Dursi

Z boson

Recent Activity

Donate For Us