x86 Assembly Force Cache Store

Tags:

I have an assignment where I am required to measure the Latency of accessing data in L1, L2, and L3 cache, as well as main memory. This is to be done in C.

I've spent several hours researching ways to measure your cache latency and have turned up very little. I have downloaded some benchmarking tools which have given me cache access times, but I have not gotten anywhere when it comes to implementing this in my own code. I understand that what happens with the cache is not up to me in C.

My next thought was that if I could force populate the cache with something from x86 assembly (first thought) then just do a clock(), access(), clock() on that data I just loaded, supposedly the time would be the accurate(ish) access time since I know it should be found in the cache since I just put it there with my inline asm or similar method...

If anyone might be able to offer insight to my assignment here, that would be fantastic. Whether it be telling me I am crazy for wanting to use asm to load something in the cache, or introducing me to something else that might help me.

Thanks so much!

200

asked Sep 02 '13 17:09

mrkanaly

2 Answers

There is no reason to use assembly at all for this assignment. Your solution does not require assembly C will work as well. I assume you are running on top of an operating system, so that is going to get in the way of your measurement, both executing things you think you know where they are and also in measuring what you think you are measuring.

Cache basics as far as taking these measurements is concerned...lets say there are four layers of memory. L1, the fastest, but also most expensive and smallest. Then L2. slower, not as expensive, likely larger than L1 in size. L3 less expensive, slower, larger, and then main memory the slowest, cheapest, and largest.

Lets just say we have four chunks of memory we are going to work with A, B, C, and D. L1 can only hold one chunk at a time. L2 two at a time, L3 three of the four and main memory all four.

If we do a read it first goes through L1, if there is a miss then L2, if a miss then L3, and if a miss then it will always be in main memory. Understand though this data is cached on the way back so L3, L2, and L1 will all contain the data just read, evicting as necessary (not always true but assume this simple cache model to understand how to complete your task). So if we read chunk A, then L1, L2, and L3 will all contain chunk A. Now in this hypothetical model if we read chunk B then L1 will contain B, evicting A. L2 will contain A and b and l3 will contain A and B. Read C and L1 will contain C, evicting B, lets say that L2 chooses to evict A, and contains B and C, and L3 contains A, B, and C. Read D and L1 will contain C lets say L2 evicts B and contains C and D, and say L3 evicts A and contains B, C, and D.

Assume that we dont exactly know how each cache chooses what to evict and what to keep. But assume that we do know or can figure out from motherboard specs or other sources, how large each cache is. If the above example happened in that order and L1 has D, L2 has C and D, L3 has B, C, and D and main has all four a,b,c,d. Then if when in that state we read all of block A and time it we are in theory reading it from main memory, it is not purely just the time to read that memory but also if any of the memory being evicted has changed it has to be written upstream possible hits all the way. But ideally if you were mostly doing just reads, then you will be timing mostly the reads.

Lets say that we found ourselves in the situation where chunk D is in l1, c and d in l2, b,c,d in l3 and we read all of chunk B and time it. Wouldnt that be measuring the time to access L3? with those same starting conditions then reading C would give us l2 timing. With those same starting conditions then reading D would be l1 timing right?

The trick is to get yourself into those conditions. The sizes of the caches are likely not such that l2 is twice the size of l1 and so on so to completely control what is in L1 you need to read enough data to fill L1. Moreso if you were to read L3 size amount of data then in theory L3 has all that data, L2 has the last L2 amount of that data and L1 has the last L1 amount of that data.

Using the data cache is easier than instruction cache but you can do this either way, you need at least L3 sized amount of instructions in main memory, a large quantity of nops. executing a linear chunk of instructions is no different than reading a linear chunk of memory. as far as read cycles goes. Instruction is easier as far as enabling and using the I cache. To enable data cache may or may not be simple based on your operating system and how you are managing memory.

157

answered Sep 30 '22 19:09

old_timer

You should be able to avoid assembler by looking at the assembler output of the compiler to understand the actual operations.

Even if you get a high-resolution clock, there is little you can do about pre-emption by the OS when running the benchmark. You will need to perform many runs to get meaningful results.

Rather than trying to place instructions into the cache, it may be easier to allow the processor to load them as they are run. If you place varying amounts of filler into the procedures, you may be able to get the cache line alignment to what you want.

answered Sep 30 '22 21:09

Pekka

Related questions
                            
                                *v8::String::Utf8Value(args[0]->ToString()) does not return the string of node.js addon argument
                            
                                Flatten binary search to in order singly linked list [C]
                            
                                C++ implemented in plain C [duplicate]
                            
                                Recursive descent parser example for C
                            
                                Why use a callback instead of a normal function call?
                            
                                Is not passing all the arguments to a function bad?
                            
                                C: Valgrind: Use of uninitialized value of size 4 AND Use of uninitialised value of size 4
                            
                                Importing a symbol from C file into linker script
                            
                                Why adding a Null Character in a String array? [duplicate]
                            
                                what is the purpose of putting a space in scanf like this scanf(" %c",&ch) in place of scanf("%c",&ch)? [duplicate]
                            
                                Why does branching occur in assembly code while comparing for a number to be in range?
                            
                                Finding the smallest integer which is not in an array [duplicate]
                            
                                Builder pattern in C
                            
                                How to be informed when user changes the language keyboard layout in Windows?
                            
                                Is an asterisk optional in a function pointer?
                            
                                Storing floating point numbers in a file
                            
                                printf the ascii char % sometimes gives c and not %
                            
                                How to cast a struct of 2 uint into a double
                            
                                Why some part of an os has to be written in assembly? [duplicate]
                            
                                Dynamic -ffast-math

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

x86 Assembly Force Cache Store

Tags:

c

caching

x86

assembly

mrkanaly

People also ask

2 Answers

old_timer

Pekka

Recent Activity

Donate For Us