Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

x86 Assembly Force Cache Store

I have an assignment where I am required to measure the Latency of accessing data in L1, L2, and L3 cache, as well as main memory. This is to be done in C.

I've spent several hours researching ways to measure your cache latency and have turned up very little. I have downloaded some benchmarking tools which have given me cache access times, but I have not gotten anywhere when it comes to implementing this in my own code. I understand that what happens with the cache is not up to me in C.

My next thought was that if I could force populate the cache with something from x86 assembly (first thought) then just do a clock(), access(), clock() on that data I just loaded, supposedly the time would be the accurate(ish) access time since I know it should be found in the cache since I just put it there with my inline asm or similar method...

If anyone might be able to offer insight to my assignment here, that would be fantastic. Whether it be telling me I am crazy for wanting to use asm to load something in the cache, or introducing me to something else that might help me.

Thanks so much!

like image 200
mrkanaly Avatar asked Sep 02 '13 17:09

mrkanaly


People also ask

Is x86 cache coherent?

x86 does handle coherence. But read up on memory consistency: it's not guaranteed that all writes (such as writing the data and releasing the lock, to name two) will be visible to all CPUs in the same order! That's what the memory fences are for.

Where is hardware cache stored?

The cache memory is located very close to the CPU, either on the CPU chip itself or on the motherboard in the immediate vicinity of the CPU and connected by a dedicated data bus. So instructions and data can be read from it (and written to it) much more quickly than is the case with normal RAM.

What is cache in ASM?

ASM cache area is used to. cache metadata blocks. The default value will suit most all implementations. I have worked on single ASM instance with 4 disk groups and 12 databases deployed on it. Best practices only talks about increasing processes parameter if your number of databases increases.

What are non temporal stores?

“Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”). So there is no benefit to keeping the data in the processor's cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).


2 Answers

There is no reason to use assembly at all for this assignment. Your solution does not require assembly C will work as well. I assume you are running on top of an operating system, so that is going to get in the way of your measurement, both executing things you think you know where they are and also in measuring what you think you are measuring.

Cache basics as far as taking these measurements is concerned...lets say there are four layers of memory. L1, the fastest, but also most expensive and smallest. Then L2. slower, not as expensive, likely larger than L1 in size. L3 less expensive, slower, larger, and then main memory the slowest, cheapest, and largest.

Lets just say we have four chunks of memory we are going to work with A, B, C, and D. L1 can only hold one chunk at a time. L2 two at a time, L3 three of the four and main memory all four.

If we do a read it first goes through L1, if there is a miss then L2, if a miss then L3, and if a miss then it will always be in main memory. Understand though this data is cached on the way back so L3, L2, and L1 will all contain the data just read, evicting as necessary (not always true but assume this simple cache model to understand how to complete your task). So if we read chunk A, then L1, L2, and L3 will all contain chunk A. Now in this hypothetical model if we read chunk B then L1 will contain B, evicting A. L2 will contain A and b and l3 will contain A and B. Read C and L1 will contain C, evicting B, lets say that L2 chooses to evict A, and contains B and C, and L3 contains A, B, and C. Read D and L1 will contain C lets say L2 evicts B and contains C and D, and say L3 evicts A and contains B, C, and D.

Assume that we dont exactly know how each cache chooses what to evict and what to keep. But assume that we do know or can figure out from motherboard specs or other sources, how large each cache is. If the above example happened in that order and L1 has D, L2 has C and D, L3 has B, C, and D and main has all four a,b,c,d. Then if when in that state we read all of block A and time it we are in theory reading it from main memory, it is not purely just the time to read that memory but also if any of the memory being evicted has changed it has to be written upstream possible hits all the way. But ideally if you were mostly doing just reads, then you will be timing mostly the reads.

Lets say that we found ourselves in the situation where chunk D is in l1, c and d in l2, b,c,d in l3 and we read all of chunk B and time it. Wouldnt that be measuring the time to access L3? with those same starting conditions then reading C would give us l2 timing. With those same starting conditions then reading D would be l1 timing right?

The trick is to get yourself into those conditions. The sizes of the caches are likely not such that l2 is twice the size of l1 and so on so to completely control what is in L1 you need to read enough data to fill L1. Moreso if you were to read L3 size amount of data then in theory L3 has all that data, L2 has the last L2 amount of that data and L1 has the last L1 amount of that data.

Using the data cache is easier than instruction cache but you can do this either way, you need at least L3 sized amount of instructions in main memory, a large quantity of nops. executing a linear chunk of instructions is no different than reading a linear chunk of memory. as far as read cycles goes. Instruction is easier as far as enabling and using the I cache. To enable data cache may or may not be simple based on your operating system and how you are managing memory.

like image 157
old_timer Avatar answered Sep 30 '22 19:09

old_timer


You should be able to avoid assembler by looking at the assembler output of the compiler to understand the actual operations.

Even if you get a high-resolution clock, there is little you can do about pre-emption by the OS when running the benchmark. You will need to perform many runs to get meaningful results.

Rather than trying to place instructions into the cache, it may be easier to allow the processor to load them as they are run. If you place varying amounts of filler into the procedures, you may be able to get the cache line alignment to what you want.

like image 25
Pekka Avatar answered Sep 30 '22 21:09

Pekka