I know that accessing main memory has a high latency if data is not cached. This question is about throughput.
What is the required complexity of a function to never be bound by main memory on a regular desktop PC?
I read about modern RAM having a bandwidth of 25-30GB/s (DDR3 RAM, dual channel mode). As far as I can tell, a single core of a modern Intel processor can store at most 32 Byte per instruction using modern SIMD instruction sets. and it can run at most 4*10^9 instructions. So effectively, it can output about 120GB/s. Given a processor with 8 threads, the maximum output amount would be about 960GB/s as a worst case estimate.
The processor can put out at most ~36 times the data that can be written to RAM. Is it safe to assume that any function that runs non-load/store operations for more than 36 cycles per SIMD store or load (or more than 9 cycles per regular 8 Byte store or load) will never ever be bound by the main memory? Can this estimate be lowered significantly or is it too low for some reasons?
Given that I have:
X = (x_1, x_2, ..., x_n) // dataset, large enough to make good use of caches
a(x), b(x), c(x, y), d(x) := c(a(x), b(x)) // functions that operate on elements
A(x) := (a(x_1), a(x_2), ..., a(x_n)) // functions that operate on data sets
I am looking for guidelines when it is better (or not worse) to implement
D(X)
as
C(A(X), B(X))
given that the first implementation puts more pressure on caches and registers and the second implementation has more load/store operations.
(Of course, you can tell me to benchmark stuff, I am fine with that. But sometimes, I just want to make an educated guess and only revisit stuff, when it becomes a problem or bottle neck later on.)
I think it very much depends on whether the code is written in such a way that the CPU can prefetch the next data item into cache. If it prefetches the wrong data then you'll still be memory bound regardless of the amount of time you spend processing the current data.
And if you have multiple threads writing to the same address (their data will be on different cache lines) then even if it has been prefetched correctly, if another thread has written to that address then it has to be dumped and re-read from main memory again.
In summary I think it's not possible to reason about these kind of things at this level and it will depend on the exact scenario you have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With