I'm having an issue with the following code, and I fail to understand where is the problem. The problem is occurring however only with V2 intel processor and not V3. Consider the following code in C++:
struct Tuple{
size_t _a;
size_t _b;
size_t _c;
size_t _d;
size_t _e;
size_t _f;
size_t _g;
size_t _h;
};
void
deref_A(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
}
void
deref_AB(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
}
void
deref_ABC(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
}
....
void
deref_ABCDEFG(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
aTuple._d = D[aIdx];
aTuple._e = E[aIdx];
aTuple._f = F[aIdx];
aTuple._g = G[aIdx];
}
Note that A, B, C, ..., G are simple arrays (declared globally). Arrays are filled with integers.
The methods "deref_*", simply assign some values from arrays (accessed via index - aIdx) to the given struct parameter "aTuple". I first start by assigning to a single field of the given struct as parameter, and continue all the way to all fields. That is, each method assigns one more field than the previous one. The methods "deref_*" are called with index (aIdx) starting from 0, to MAX size of the arrays (arrays have the same size by the way). The index is used to access array elements, as shown in the code -- pretty simple.
Now, consider the graph (http://docdro.id/AUSil1f), which depicts the performance for array sizes starting with 20 million (size_t = 8 bytes) integers, up to 24 m (x-axes denote the arrays size).
For arrays with 21 million integers (size_t), the performance degrades for the methods touching at least 5 different arrays (i.e., deref_ACDE...G), therefore you will see peaks in the graph. The performance then improves again for arrays with 22 m integers and onwards. I'm wondering why this is happening for array size of 21 m only? This is happening only when I'm testing on a server with CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, but not with Haswell, i.e., v3. Clearly this is a known issue to Intel and has been resolved, but I don't know what it is, and how to improve the code for v2.
I would highly appreciate any hint from your side.
I suspect you might be seeing cache-bank conflicts. Sandybridge/Ivybridge (Xeon Exxxx v1/v2) have them, Haswell (v3) doesn't.
Update from the OP: it was DTLB misses. Cache bank conflicts will usually only be an issue when your working set fits in cache. Being limited to one 8B read per clock instead of 2 shouldn't stop a a CPU from keeping up with main memory, even single-threaded. (8B * 3GHz = 24GB/s
, which is about equal to main memory sequential-read bandwidth.)
I think there's a perf counter for that, which you can check with perf
or other tools.
Quoting Agner Fog's microarchitecture doc (section 9.13):
Each consecutive 128 bytes, or two cache lines, in the data cache is divided into 8 banks of 16 bytes each. It is not possible to do two memory reads in the same clock cycle if the two memory addresses have the same bank number, i.e. if bit 4 - 6 in the two addresses are the same.
; Example 9.5. Sandy bridge cache
mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict
Changing the total size of your arrays changes the distance between two elements with the same index, if they're laid out more or less head to tail.
If you have each array aligned to a different 16B offset (modulo 128), this will help some for SnB/IvB. Access to the same index in each array will be in a different cache bank, and thus can happen in parallel. Achieving this can be as simple as allocating 128B-aligned arrays, with 16*n extra bytes at the start of each one. (Keeping track of the pointer to eventually free separately from the pointer to dereference will be an annoyance.)
If the tuple where you're writing the results has the same address as a read, modulo 4096, you also get a false dependence. (i.e. a read from one of the arrays might have to wait for a store to the tuple.) See Agner Fog's doc for the details on that. I didn't quote that part because I think cache-bank conflicts are the more likely explanation. Haswell still has the false-dependence issue, but the cache-bank conflict issue is completely gone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With