What is the easiest and most effective way to measure the bandwidth that my application (multi-threaded, written with OpenMP) is using? I ran STREAM to get the max. sustainable bandwidth, and I'd like now to know if I am saturating the whole available bandwidth or not.
I found a couple related questions (e.g. Main memory bandwidth measurement), but I could not find the answer to this question;
Sadly, I cannot use VTune, but I can use PAPI counters;
My primary goal is to find out if the poor scalability of my application is connected to the saturation of the memory bandwidth.
Thanks
There's a number of ways of getting (from the command line) the bandwidth over the whole application, but it sounds like there are a number of kernels you'd like to look at individually. In that case, wrapping parts of your code with PAPI calls is a perfectly sensible way to go.
You can use PAPI event counters on your system (papi_avail
) to find the total number of load/store instructions, and if you know the sizes of your load/stores you can get the memory bandwidth. Alternately, you can count for hits in your caches, and multiply by the line sizes, to infer the actual amount of data transferred across the system. There is documentation in various places on the PAPI wiki, e.g. here for the high-level interface, and here's some useful formula for helpful derived quantities.
Here's a coded-up simple example, doing a matrix-vector multiplication the sensible way and the cache-unfriendly transposed way. Note that calling PAPI_read_counters resets the counters, which is what we want here.
#include <stdio.h>
#include <stdlib.h>
typedef char * caddr_t;
#include <papi.h>
#include <sys/time.h>
int init(float ***a, float **x, float **y, int size);
void report_results(char *tname, long_long *values, const int n, double wtime);
void sensible_matvec(float **a, float *x, float *y, int size);
void wrong_order_matvec(float **a, float *x, float *y, int size);
void tick(struct timeval *t);
double tock(struct timeval *t);
#define NUM_EVENTS 3
int main(int argc, char **argv) {
const int matsize = 4096;
float **a, *x, *y;
init(&a, &x, &y, matsize);
int events[NUM_EVENTS] = {PAPI_L1_DCM, PAPI_LST_INS, PAPI_FP_INS};
long_long values[NUM_EVENTS];
double walltime;
struct timeval t;
if (PAPI_start_counters(events, NUM_EVENTS) != PAPI_OK) {
fprintf(stderr, "Error starting PAPI counters; aborting\n");
exit(1);
}
tick(&t);
sensible_matvec(a, x, y, matsize);
PAPI_read_counters(values, NUM_EVENTS);
walltime = tock(&t);
report_results("Sensible", values, NUM_EVENTS, walltime);
tick(&t);
wrong_order_matvec(a, x, y, matsize);
PAPI_stop_counters(values, NUM_EVENTS);
walltime = tock(&t);
report_results("Wrong order", values, NUM_EVENTS, walltime);
return 0;
}
void report_results(char *tname, long_long *values, const int n, double wtime) {
long_long total_mem = values[1];
long_long total_flops = values[2];
long_long l1misses = values[0];
printf("Test %s: time elapsed = %f, memory accesses = %lld, flop = %lld\n",
tname, wtime, total_mem, total_flops);
printf("\tMemory bandwidth (MB/sec) = %f\n", 1.0*total_mem*sizeof(float)/(wtime*1024*1024));
printf("\tL1 cache miss rate = %f\n", 1.0*l1misses/total_mem);
printf("\tMFLOPS = %lf\n\n", 1.0*total_flops/(wtime*1024*1024));
}
int alloc2d(float ***a, int n);
int free2d(float ***a, int n);
int alloc1d(float **x, int n);
int free1d(float **x, int n);
int init(float ***a, float **x, float **y, int size) {
if (alloc2d(a,size))
return -2;
if (alloc1d(x,size)) {
free2d(a,size);
return -2;
}
if (alloc1d(y,size)) {
free2d(a,size);
free1d(x,size);
return -3;
}
for (int i=0; i<size; i++) {
(*x)[i] = (float)i;
(*y)[i] = 0.;
}
for (int i=0; i<size; i++) {
for (int j=0; j<size; j++) {
(*a)[i][j] = i;
}
}
return 0;
}
void sensible_matvec(float **a, float *x, float *y, int size) {
for (int i=0; i<size; i++) {
for (int j=0; j<size; j++) {
y[i] += a[i][j]*x[j];
}
}
}
void wrong_order_matvec(float **a, float *x, float *y, int size) {
for (int j=0; j<size; j++) {
for (int i=0; i<size; i++) {
y[i] += a[i][j]*x[j];
}
}
}
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void freeall(float ***a, float **x, float **y, int size) {
free2d(a, size);
free1d(x, size);
free1d(y, size);
return;
}
int alloc2d(float ***a, int n) {
float *data = (float *)malloc(n*n*sizeof(float));
if (data == NULL) return -1;
*a = (float **)malloc(n*sizeof(float *));
if (*a == NULL) {free(data); return -1;};
for (int i=0; i<n; i++)
(*a)[i] = &(data[i*n]);
return 0;
}
int free2d(float ***a, int n) {
free (&((*a)[0][0]));
free(*a);
return 0;
}
int alloc1d(float **a, int n) {
*a = (float *)malloc(n*sizeof(float));
if (*a == NULL) return -1;
return 0;
}
int free1d(float **a, int n) {
free(*a);
return 0;
}
Running gives:
$ gcc -o papi-test papi-test.c -I${PAPI_INC_DIR} -L${PAPI_LIB_DIR} -lpapi -Wall -std=c99
$ ./papi-test
Test Sensible: time elapsed = 0.121877, memory accesses = 302020775, flop = 33580481
Memory bandwidth (MB/sec) = 9453.119330
L1 cache miss rate = 0.003921
MFLOPS = 262.763624
Test Wrong order: time elapsed = 0.537639, memory accesses = 302026751, flop = 39629352
Memory bandwidth (MB/sec) = 2142.963254
L1 cache miss rate = 0.094045
MFLOPS = 70.295301
To measure the bandwidth of your application you need to know how much memory is being read and/or written, lets call that the numerator, and you need to know how much time it takes to read and/or write it, lets call that the denominator. The bandwidth is the numerator/denominator.
If you're application is complicated then it might not so easy to calculate how much memory is being read and/or written. Additionally, if your application is doing many other operations it might not be easy to calculate the time. You would have to subtract off the time of the other operations. Therefore, when measuring maximum throughput simple algorithms are usually used.
If you want to pick a simile algorithm to try and compare with your application then you should see if your application only writes to data, only reads from data, or both reads and writes.
If you're only writing data you can use a write (memset) test:
#pragam omp parallel for
for(int i=0; i<n; i++) {
x[i] = k;
}
If you're both reading and writing data you can do a simple copy (memcpy) test
#pragma omp parallel for
for(int i=0; i<n; i++) {
y[i] = x[i];
}
In fact if you look at the STREAM source code that's basically what it does for the copy test.
If you're only reading data you can do a reduction like this (make sure to compile with -ffast-math
if you want this to vectorize):
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<n; i++) {
sum += x[i]*y[i];
}
STREAM's test are all read and write tests. I have written my own bandwidth tool which does writes only, reads and writes, and reads only.
Unfortunately, the tests which write data won't get close to the peak bandwidth. The reason is that in order to write data they have to read the data into the cache first. This is the reason that STREAM does not get anywhere close to the peak bandwidth on my system. In order to get the peak bandwidth when doing writes you need to do non-temporal stores which only write to data without first reading it into the cache.
For example with SSE and assuming that x
and y
are float array you could do the read and write test like this:
#pragma omp parallel for
for(int i=0; i<n/4; i++) {
__m128 v = _mm_load_ps(&x[4*i]);
_mm_stream_ps(&y[4*i], v);
}
If you look at Agner Fog's asmlib you will see this is exactly what he does for memset
and memcpy
for large arrays. In fact his asmlib and that example I just gave get 85% (45 GB/s out of 51 GB/s) of the bandwidth on my system whereas the STREAM tests only get about 45% of the bandwidth.
These tests assume that your algorithm is memory bound and to compare you read an array much larger than the slowest cache. If your algorithm reuses data that's still in the cache then the read tests won't get close to the peak bandwidth either because of carried loop dependency. To fix that you have to unroll 3-10 times depending on the operation and hardware. Also, if you're doing writes for arrays which fit in the cache which you will reuse then you don't want to do non-temporal stores. That's why Agner Fog's asmlib only uses non-temporal stores for large arrays.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With