Roofline model: calculating operational intensity

Tags:

Say I have a toy loop like this

float x[N];
float y[N];
for (int i = 1; i < N-1; i++)
    y[i] = a*(x[i-1] - x[i] + x[i+1])

And I assume my cache line is 64 Byte (i.e. big enough). Then I will have (per frame) basically 2 accesses to the RAM and 3 FLOP:

1 (cached) read access: loading all 3 x[i-1], x[i], x[i+1]
1 write access: storing y[i]
3 FLOP (1 mul, 1 add, 1 sub)

The operational intensity is ergo

OI = 3 FLOP/(2 * 4 BYTE)

Now what happens if I do something like this

float x[N];
for (int i = 1; i < N-1; i++)
    x[i] = a*(x[i-1] - x[i] + x[i+1])

Note that there is no y anymore. Does it mean now that I have a single RAM access

1 (cached) read/write: loading x[i-1], x[i], x[i+1], storing x[i]

or still 2 RAM accesses

1 (cached) read: loading x[i-1], x[i], x[i+1]
1 (cached) write: storing x[i]

Because the operational intensity OI would be different in either case. Can anyone tell something about this? Or maybe clarify some things. Thanks

702

asked Nov 22 '16 22:11

Armen Avetisyan

1 Answers

Disclaimer: I've never heard of the roofline performance model until today. As far as I can tell, it attempts to calculate a theoretical bound on the "arithmetic intensity" of an algorithm, which is the number of FLOPS per byte of data accessed. Such a measure may be useful for comparing similar algorithms as the size of N grows large, but is not very helpful for predicting real-world performance.

As a general rule of thumb, modern processors can execute instructions much more quickly than they can fetch/store data (this becomes drastically more pronounced as the data starts to grow larger than the size of the caches). So contrary to what one might expect, a loop with higher arithmetic intensity may run much faster than a loop with lower arithmetic intensity; what matters most as N scales is the total amount of data touched (this will hold true as long as memory remains significantly slower than the processor, as is true in common desktop and server systems today).

In short, x86 CPUs are unfortunately too complex to be accurately described with such a simple model. An access to memory goes through several layers of caching (typically L1, L2, and L3) before hitting RAM. Maybe all your data fits in L1 -- the second time you run your loop(s) there could be no RAM accesses at all.

And there's not just the data cache. Don't forget that code is in memory too and has to be loaded into the instruction cache. Each read/write is also done from/to a virtual address, which is supported by the hardware TLB (that can in extreme cases trigger a page fault and, say, cause the OS to write a page to disk in the middle of your loop). All of this is assuming your program is hogging the hardware all to itself (in non-realtime OSes this is simply not the case, as other processes and threads are competing for the same limited resources).

Finally, the execution itself is not (directly) done with memory reads and writes, but rather the data is loaded into registers first (then the result is stored).

How the compiler allocates registers, if it attempts loop unrolling, auto-vectorization, the instruction scheduling model (interleaving instructions to avoid data dependencies between instructions) etc. will also all affect the actual throughput of the algorithm.

So, finally, depending on the code produced, the CPU model, the amount of data processed, and the state of various caches, the latency of the algorithm will vary by orders of magnitude. Thus, the operational intensity of a loop cannot be determined by inspecting the code (or even the assembly produced) alone, since there are many other (non-linear) factors in play.

To address your actual question, though, as far as I can see by the definition outlined here, the second loop would count as a single additional 4-byte access per iteration on average, so its OI would be θ(3N FLOPS / 4N bytes). Intuitively, this makes sense because the cache already has the data loaded, and the write can change the cache directly instead of going back to main memory (the data does eventually have to be written back, however, but that requirement is unchanged from the first loop).

105

answered Oct 03 '22 20:10

Cameron

Related questions
                            
                                Visual C++: forward an array as a pointer
                            
                                Do inline namespace variables have internal linkage? If not, why does the code below work?
                            
                                Choose luminosity (exposure) from HDR image
                            
                                std::is_constructible doesn't give the correct result [duplicate]
                            
                                Why atomic_flag default constructor leaves state unspecified?
                            
                                Using non-exporting functions inside templates in C++ modules
                            
                                does passing lambda by value or reference make it easier to inline?
                            
                                How to autocomplete code with C/C++ in Android Studio with Android NDK
                            
                                Confusion while deriving from std::tuple, can not handle std::get
                            
                                deprecated warnings while using boost.spirit
                            
                                How to design this program with easy modification in mind?
                            
                                Is this trailing return type legal in C++11?
                            
                                How to identify if a process is java or c or c++ process in linux?
                            
                                C++11 regex::icase inconsistent behavior
                            
                                SFINAE constexpr with std::get
                            
                                Why is there a difference between std::function constructor or assignment?
                            
                                Linking error in OpenGL using freeglut in CLion
                            
                                Why Intel compiler ignores the non-temporal prefetch pragma directive for Intel MIC?
                            
                                What is the deduced type of a constexpr?
                            
                                OpenMP paralelization inhibits vectorization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Roofline model: calculating operational intensity

Tags:

c++

performance

memory-bandwidth

roofline

Armen Avetisyan

People also ask

1 Answers

Cameron

Recent Activity

Donate For Us