I have two arrays: <code>A</code> with <code>N_A</code> random integers and <code>B</code> with <code>N_B</code> random integers between <code>0</code> and <code>(N_A - 1)</code>. I use the numbers in <code>B</code> as indices into <code>A</code> in the following loop: <pre class="prettyprint"><code>for(i = 0; i < N_B; i++) { sum += A[B[i]]; } </code></pre> Experimenting on an Intel i7-3770, <code>N_A</code> = 256 million, <code>N_B</code> = 64 million, this loop takes only .62 seconds, which corresponds to a memory access latency of about 9 nanoseconds. As this latency is too small, I was wondering if the hardware prefetcher is playing a role. Can someone offer an explanation?

The CPU charges ahead in the instruction stream and will juggle multiple outstanding loads at once. The stream looks like this: <pre class="prettyprint"><code>load b[0] load a[b[0]] add loop code load b[1] load a[b[1]] add loop code load b[1] load a[b[1]] add loop code ... </code></pre> The iterations are only serialized by the loop code, which runs quickly. All loads can run concurrently. Concurrency is just limited by how many loads the CPU can handle. I suspect you wanted to benchmark random, unpredictable, serialized memory loads. This is actually pretty hard on a modern CPU. Try to introduce an unbreakable dependency chain: <pre class="prettyprint"><code>int lastLoad = 0; for(i = 0; i < N_B; i++) { var load = A[B[i] + (lastLoad & 1)]; //be sure to make A one element bigger sum += load; lastLoad = load; } </code></pre> This requires the last load to be executed until the address of the next load can be computed.

Does the hardware-prefetcher benefit in this memory access pattern?

Tags:

performance

ram

hardware

prefetch

I have two arrays: A with N_A random integers and B with N_B random integers between 0 and (N_A - 1). I use the numbers in B as indices into A in the following loop:

for(i = 0; i < N_B; i++) {
    sum += A[B[i]];
}

Experimenting on an Intel i7-3770, N_A = 256 million, N_B = 64 million, this loop takes only .62 seconds, which corresponds to a memory access latency of about 9 nanoseconds.

As this latency is too small, I was wondering if the hardware prefetcher is playing a role. Can someone offer an explanation?

348

asked Feb 16 '14 08:02

Anuj Kalia

2 Answers

The HW prefetcher can see through your first level of indirection (B[i]) since these elements are sequential. It's capable of issuing multiple prefetches ahead, so you could assume that the average access into B would hit the caches (either L1 or L2). However, there's no way that the prefetcher can predict random addresses (the data stored in B) and prefetch the correct elements from A. You still have to perform a memory access in almost all accesses to A (disregarding occasional lucky cache hits due to reuse of lines)

The reason you see such low latency is that the accesses into A are non serialized, the CPU can access multiple elements of A simultaneously, so the time doesn't just accumulate. In fact, you measure memory BW here, checking how long it takes to access 64M elements overall, not memory latency (how long it takes to access a single element).

A reasonable "snapshot" of the CPU memory unit should show several outstanding requests - a few accesses into B[i], B[i+64], ... (the intermediate accesses should simply get merged as each request fetches a 64Byte line), all of which would probably be prefetches reflecting future values of i, intermixed with random accesses to A elements according to the previously fetched elements of B.

To measure latency, you need each access to depends on the result of the previous one, for e.g. by making the content of each element in A the index of the next access.

106

answered Oct 02 '22 12:10

Leeor

The CPU charges ahead in the instruction stream and will juggle multiple outstanding loads at once. The stream looks like this:

load b[0]
load a[b[0]]
add
loop code

load b[1]
load a[b[1]]
add
loop code

load b[1]
load a[b[1]]
add
loop code

...

The iterations are only serialized by the loop code, which runs quickly. All loads can run concurrently. Concurrency is just limited by how many loads the CPU can handle.

I suspect you wanted to benchmark random, unpredictable, serialized memory loads. This is actually pretty hard on a modern CPU. Try to introduce an unbreakable dependency chain:

int lastLoad = 0;
for(i = 0; i < N_B; i++) {
    var load = A[B[i] + (lastLoad & 1)]; //be sure to make A one element bigger
    sum += load;
    lastLoad = load;
}

This requires the last load to be executed until the address of the next load can be computed.

answered Oct 02 '22 13:10

usr

Related questions
                            
                                Efficient(?) string comparison
                            
                                The fastest way to calculate eigenvalues of large matrices
                            
                                Speed up numpy.where for extracting integer segments?
                            
                                Efficiency when checking multiple conditions with Java [duplicate]
                            
                                Java application slow because of heap
                            
                                This programs takes a long time to close after the 'return;' on main()
                            
                                UITableView reloadData is slow
                            
                                What is happening between receiving HTML and DOM ready?
                            
                                Finding the minimum element in a given range greater than a given number
                            
                                Least CPU intensive loop
                            
                                Remove reoccuring lines from text file with enhanced performance
                            
                                Efficient execution and output stream redirection of process spawned with Runtime.exec()
                            
                                Strange performance drop after innocent changes to a trivial program
                            
                                Why is this LINQ IQueryable Performance for Pagination so poor?
                            
                                How to speed up row selection by column value for big Pandas dataframe
                            
                                AngularJS app freezes when loading in new data to $scope
                            
                                Is there a way to consistently determine which of two stored procedures is faster?
                            
                                Png's in iOS - png's in bundle vs downloaded - png8 vs png24
                            
                                What performance tradeoffs exist between various clojure matrix libraries?
                            
                                How to read specific rows of CSV file with fread function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With