Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why cache read miss is faster than write miss?

I need to calculate an array (writeArray) using another array (readArray) but the problem is the index mapping is not the same between arrays (Value at index x of writeArray must be calculated with value at index y of readArray) so it's not very cache friendly.

However I can either choose if the loop browses readArray sequentially or writeArray sequentially.

So here is a simplified code :

int *readArray = new int[ARRAY_SIZE];       // Array to read
int *writeArray = new int[ARRAY_SIZE];      // Array to write
int *refArray = new int[ARRAY_SIZE];        // Index mapping between read and write, could be also array of pointers instead indexes

// Code not showed here : Initialization of readArray with values, writeArray with zeroes and refArray with random indexes for mapping between readArray and writeArray (values of indexes between 0 and ARRAY_SIZE - 1)

// Version 1: Random read (browse writeArray/refArray sequentially)
for (int n = 0; n < ARRAY_SIZE; ++n) {
    writeArray[n] = readArray[refArray[n]];
}

// Version 2: Random write (browse readArray/refArray sequentially)
for (int n = 0; n < ARRAY_SIZE; ++n) {
    writeArray[refArray[n]] = readArray[n];
}

I was thinking that cache read misses was more slower than write misses (because CPU needs to wait before reading complete if the next instruction depends of read data but for writing it doesn't need to wait for processing the next instruction) but with profiling it's seems that version 1 is faster than version 2 (version 2 is about 50% more slower than version 1).

I tried also this :

// Version 3: Same as version 2 but without polluting cache
for (int n = 0; n < ARRAY_SIZE; ++n) {
    _mm_stream_si32(&writeArray[refArray[n]], readArray[n]);
}

Because I don't need to read values of writeArray so there is no reason to pollute the cache with written values but this version is more more slower than other versions (6700% more slower than version 1).

Why write miss is slower than read miss ? Why bypassing cache for writing is more slower than using it even if we don't read these written data after ?

like image 659
Johnmph Avatar asked Mar 12 '15 14:03

Johnmph


1 Answers

Let's start with the last version - what you did is use streaming stores for non sequential (not a stream) access pattern. You're randomly accessing integers, which means you're doing partial writes (int sized) into full cache lines. On normal writes this shouldn't matter, since the core pulls the line into cache, and simply modifies the necessary chunk (which will later be followed by writing it back when you need the storage for something else), but since you ask it to avoid caching, you actually have to do this partial merge in memory which is very expensive and blocking. Streaming stores are useful only when you're guaranteed to modify the full line (for e.g. by going over the array sequentially).

As for the 2nd version - your assumption is correct, if there was data dependency through the loads you would have had to wait for them, but there is no real dependency chain here. You only have a set of loads with a 2-level dependency, but no interdependency between them to cause any serialization across iterations (i.e. iteration n==2 and n==3 may start even before n==1 gets the first load done). Effectively, assuming your CPU can sustain N outstanding accesses (depending on the sizes and cache levels involved), you'll launch the first N references to refArray in parallel (assuming the index calculation is fast), followed by the first N references to readArray, and then the next batch and so on.

Now, since there's no data dependency, it becomes a question of bandwidth. In that case, generally speaking, loads are much easier for the processor due to their out-of-order nature - you can launch them in parallel and out of order, once you know the address (which only depends on the fast index calculation). Stores, on the other hand, need to be observed in program order (to retain memory consistency), which almost serializes them (there are some possible CPU tricks there, depending on your exact micro-architecture, but it won't change the big picture).

Edit: Another constraint added in version 2 (which I believe is even more critical), is memory disambiguation. The processor has to calculate the loads and stores addresses, in order to know if there's any collision (we know there isn't, but the processor doesn't...). If a load depends on a store, it has to be blocked, in case the new data has to be forwarded. Now, since loads are launched in an OOO machine early, it becomes vital to know the addresses for all the stores as early as possible to avoid collisions (or worse - speculations that fail and cause mass flushes)

like image 181
Leeor Avatar answered Oct 26 '22 03:10

Leeor