Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what does STREAM memory bandwidth benchmark really measure?

I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark.

  1. Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache?
 *       (a) Each array must be at least 4 times the size of the
 *           available cache memory. I don't worry about the difference
 *           between 10^6 and 2^20, so in practice the minimum array size
 *           is about 3.8 times the cache size.
  1. I originally assume STREAM measures the peak memory bandwidth. But I later found that when I add extra arrays and array accesses, I can get larger bandwidth numbers. So it looks to me that STREAM doesn't guarantee to saturate memory bandwidth. Then my question is what does STREAM really measures and how do you use the numbers reported by STREAM?

For example, I added two extra arrays and make sure to access them together with the original a/b/c arrays. I modify the bytes accounting accordingly. With these two extra arrays, my bandwidth number is bumped up by ~11.5%.

> diff stream.c modified_stream.c
181c181,183
<                       c[STREAM_ARRAY_SIZE+OFFSET];
---
>                       c[STREAM_ARRAY_SIZE+OFFSET],
>                       e[STREAM_ARRAY_SIZE+OFFSET],
>                       d[STREAM_ARRAY_SIZE+OFFSET];
192,193c194,195
<     3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
<     3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
---
>     5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
>     5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
270a273,274
>             d[j] = 3.0;
>             e[j] = 3.0;
335c339
<           c[j] = a[j]+b[j];
---
>           c[j] = a[j]+b[j]+d[j]+e[j];
345c349
<           a[j] = b[j]+scalar*c[j];
---
>           a[j] = b[j]+scalar*c[j] + d[j]+e[j];

CFLAGS = -O2 -fopenmp -D_OPENMP -DSTREAM_ARRAY_SIZE=50000000

My last level cache is around 35MB.

Any commnet?

Thanks!

This is for a Skylake Linux server.

like image 544
yeeha Avatar asked May 11 '19 03:05

yeeha


People also ask

How does Stream benchmark work?

The purpose of the STREAM benchmark is to measure the effective memory bandwidth of modern CPU-based architectures. This is done by measuring the time of four loops that operate on large arrays (vectors) that do not fit into the CPU cache: Copy : b[:] = a[:]

How is memory bandwidth measured?

To measure the memory bandwidth for a function, I wrote a simple benchmark. For each function, I access a large 3 array of memory and compute the bandwidth by dividing by the run time 4. For example, if a function takes 120 milliseconds to access 1 GB of memory, I calculate the bandwidth to be 8.33 GB/s.

What is Stream Triad?

STREAM Triad is a memory bandwidth benchmark that multiplies a large 1D array by a scalar, adds it to a second array, and assigns it to a third. A good implementation will use no communication during the computation, making it pleasingly parallel.

What is memory bandwidth in CPU?

The maximum memory bandwidth is the maximum rate at which data can be read from or stored into a semiconductor memory by the processor (in GB/s).


1 Answers

Memory accesses in modern computers are a lot more complex than one might expect, and it is very hard to tell when the "high-level" model falls apart because of some "low-level" detail that you did not know about before....

The STREAM benchmark code only measures execution time -- everything else is derived. The derived numbers are based on both decisions about what I think is "reasonable" and assumptions about how the majority of computers work. The run rules are the product of trial and error -- attempting to balance portability with generality.

The STREAM benchmark reports "bandwidth" values for each of the kernels. These are simple calculations based on the assumption that each array element on the right hand side of each loop has to be read from memory and each array element on the left hand side of each loop has to be written to memory. Then the "bandwidth" is simply the total amount of data moved divided by the execution time.

There are a surprising number of assumptions involved in this simple calculation.

  • The model assumes that the compiler generates code to perform all the loads, stores, and arithmetic instructions that are implied by the memory traffic counts. The approach used in STREAM to encourage this is fairly robust, but an advanced compiler might notice that all the array elements in each array contain the same value, so only one element from each array actually needs to be processed. (This is how the validation code works.)
  • Sometimes compilers move the timer calls out of their source-code locations. This is a (subtle) violation of the language standards, but is easy to catch because it usually produces nonsensical results.
  • The model assumes a negligible number of cache hits. (With cache hits, the computed value is still a "bandwidth", it is just not the "memory bandwidth".) The STREAM Copy and Scale kernels only load one array (and store one array), so if the stores bypass the cache, the total amount of traffic going through the cache in each iteration is the size of one array. Cache addressing and indexing are sometimes very complex, and cache replacement policies may be dynamic (either pseudo-random or based on run-time utilization metrics). As a compromise between size and accuracy, I picked 4x as the minimum array size relative to the cache size to ensure that most systems have a very low fraction of cache hits (i.e., low enough to have negligible influence on the reported performance).
  • The data traffic counts in STREAM do not "give credit" to additional transfers that the hardware does, but that were not explicitly requested. This primarily refers to "write allocate" traffic -- most systems read each store target address from memory before the store can update the corresponding cache line. Many systems have the ability to skip this "write allocate", either by allocating a line in the cache without reading it (POWER) or by executing stores that bypass the cache and go straight to memory (x86). More notes on this are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/
  • Multicore processors with more than 2 DRAM channels are typically unable to reach asymptotic bandwidth using only a single core. The OpenMP directives that were originally provided for large shared-memory systems now must be enabled on nearly every processor with more than 2 DRAM channels if you want to reach asymptotic bandwidth levels.
  • Single-core bandwidth is still important, but is typically limited by the number of cache misses that a single core can generate, and not by the peak DRAM bandwidth of the system. The issues are presented in http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
  • For the single-core case, the number of outstanding L1 Data Cache misses much too small to get full bandwidth -- for your Xeon Scalable processor about 140 concurrent cache misses are required for each socket, but a single core can only support 10-12 L1 Data Cache misses. The L2 hardware prefetchers can generate additional memory concurrency (up to ~24 cache misses per core, if I recall correctly), but reaching average values near the upper end of this range requires simultaneous accesses to more 4KiB pages. Your additional array reads give the L2 hardware prefetchers more opportunity to generate (close to) the maximum number of concurrent memory accesses. An increase of 11%-12% is completely reasonable.
  • Increasing the fraction of reads is also expected to increase the performance when using all the cores. In this case the benefit is primarily by reducing the number of "read-write turnaround stalls" on the DDR4 DRAM interface. With no stores at all, sustained bandwidth should reach 90% peak on this processor (using 16 or more cores per socket).

Additional notes on avoiding "write allocate" traffic:

  1. In x86 architectures, cache-bypassing stores typically invalidate the corresponding address from the local caches and hold the data in a "write-combining buffer" until the processor decides to push the data to memory. Other processors are allowed to keep and use "stale" copies of the cache line during this period. When the write-combining buffer is flushed, the cache line is sent to the memory controller in a transaction that is very similar to an IO DMA write. The memory controller has the responsibility of issuing "global" invalidations on the address before updating memory. Care must be taken when these streaming stores are used to update memory that is shared across cores. The general model is to execute the streaming stores, execute a store fence, then execute an "ordinary" store to a "flag" variable. The store fence will ensure that no other processor can see the updated "flag" variable until the results of all of the streaming stores are globally visible. (With a sequence of "ordinary" stores, results always become visible in program order, so no store fence is required.)
  2. In the PowerPC/POWER architecture, the DCBZ (or DCLZ) instruction can be used to avoid write allocate traffic. If the line is in cache, its contents are set to zero. If the line is not in cache, a line is allocated in the cache with its contents set to zero. One downside of this approach is that the cache line size is exposed here. DCBZ on a PowerPC with 32-Byte cache lines will clear 32 Bytes. The same instruction on a processor with 128-Byte cache lines will clear 128 Bytes. This was irritating to a vendor who used both. I don't remember enough of the details of the POWER memory ordering model to comment on how/when the coherence transactions become visible with this instruction.
like image 50
John D McCalpin Avatar answered Oct 03 '22 18:10

John D McCalpin