What is the Cost of an L1 Cache Miss?

Tags:

Edit: For reference purposes (if anyone stumbles across this question), Igor Ostrovsky wrote a great post about cache misses. It discusses several different issues and shows example numbers. End Edit

I did some testing <long story goes here> and am wondering if a performance difference is due to memory cache misses. The following code demonstrates the issue and boils it down to the critical timing portion. The following code has a couple of loops that visit memory in random order and then in ascending address order.

I ran it on an XP machine (compiled with VS2005: cl /O2) and on a Linux box (gcc –Os). Both produced similar times. These times are in milliseconds. I believe all loops are running and are not optimized out (otherwise it would run “instantly”).

 *** Testing 20000 nodes Total Ordered Time: 888.822899 Total Random Time: 2155.846268

Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).

typedef struct stItem {    long     lData;    //char     acPad[20]; } LIST_NODE;    #if defined( WIN32 ) void StartTimer( LONGLONG *pt1 ) {    QueryPerformanceCounter( (LARGE_INTEGER*)pt1 ); }  void StopTimer( LONGLONG t1, double *pdMS ) {    LONGLONG t2, llFreq;     QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );    QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );    *pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0; } #else // doesn't need 64-bit integer in this case void StartTimer( LONGLONG *pt1 ) {    // Just use clock(), this test doesn't need higher resolution    *pt1 = clock(); }  void StopTimer( LONGLONG t1, double *pdMS ) {    LONGLONG t2 = clock();    *pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 ); } #endif    long longrand() {    #if defined( WIN32 )    // Stupid cheesy way to make sure it is not just a 16-bit rand value    return ( rand() << 16 ) | rand();    #else    return rand();    #endif }  // get random value in the given range int randint( int m, int n ) {    int ret = longrand() % ( n - m + 1 );    return ret + m; }  // I think I got this out of Programming Pearls (Bentley). void ShuffleArray (    long *plShuffle,  // (O) return array of "randomly" ordered integers    long lNumItems    // (I) length of array ) {    long i;    long j;    long t;     for ( i = 0; i < lNumItems; i++ )       plShuffle[i] = i;     for ( i = 0; i < lNumItems; i++ )       {       j = randint( i, lNumItems - 1 );        t = plShuffle[i];       plShuffle[i] = plShuffle[j];       plShuffle[j] = t;       } }    int main( int argc, char* argv[] ) {    long          *plDataValues;    LIST_NODE     *pstNodes;    long          lNumItems = 20000;    long          i, j;    LONGLONG      t1;  // for timing    double dms;     if ( argc > 1 && atoi(argv[1]) > 0 )       lNumItems = atoi( argv[1] );     printf( "\n\n*** Testing %u nodes\n", lNumItems );     srand( (unsigned int)time( 0 ));     // allocate the nodes as one single chunk of memory    pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));    assert( pstNodes != NULL );     // Create an array that gives the access order for the nodes    plDataValues = (long*)malloc( lNumItems * sizeof( long ));    assert( plDataValues != NULL );     // Access the data in order    for ( i = 0; i < lNumItems; i++ )       plDataValues[i] = i;     StartTimer( &t1 );     // Loop through and access the memory a bunch of times    for ( j = 0; j < lNumItems; j++ )       {       for ( i = 0; i < lNumItems; i++ )          {          pstNodes[plDataValues[i]].lData = i * j;          }       }     StopTimer( t1, &dms );    printf( "Total Ordered Time: %f\n", dms );     // now access the array positions in a "random" order    ShuffleArray( plDataValues, lNumItems );     StartTimer( &t1 );     for ( j = 0; j < lNumItems; j++ )       {       for ( i = 0; i < lNumItems; i++ )          {          pstNodes[plDataValues[i]].lData = i * j;          }       }     StopTimer( t1, &dms );    printf( "Total Random Time: %f\n", dms );  }

685

asked Jul 14 '09 16:07

Mark Wilkins

1 Answers

Here is an attempt to provide insight into the relative cost of cache misses by analogy with baking chocolate chip cookies ...

Your hands are your registers. It takes you 1 second to drop chocolate chips into the dough.

The kitchen counter is your L1 cache, twelve times slower than registers. It takes 12 x 1 = 12 seconds to step to the counter, pick up the bag of walnuts, and empty some into your hand.

The fridge is your L2 cache, four times slower than L1. It takes 4 x 12 = 48 seconds to walk to the fridge, open it, move last night's leftovers out of the way, take out a carton of eggs, open the carton, put 3 eggs on the counter, and put the carton back in the fridge.

The cupboard is your L3 cache, three times slower than L2. It takes 3 x 48 = 2 minutes and 24 seconds to take three steps to the cupboard, bend down, open the door, root around to find the baking supply tin, extract it from the cupboard, open it, dig to find the baking powder, put it on the counter and sweep up the mess you spilled on the floor.

And main memory? That's the corner store, 5 times slower than L3. It takes 5 x 2:24 = 12 minutes to find your wallet, put on your shoes and jacket, dash down the street, grab a litre of milk, dash home, take off your shoes and jacket, and get back to the kitchen.

Note that all these accesses are constant complexity -- O(1) -- but the differences between them can have a huge impact on performance. Optimizing purely for big-O complexity is like deciding whether to add chocolate chips to the batter 1 at a time or 10 at a time, but forgetting to put them on your grocery list.

Moral of the story: Organize your memory accesses so the CPU has to go for groceries as rarely as possible.

Numbers were taken from the CPU Cache Flushing Fallacy blog post, which indicates that for a particular 2012-era Intel processor, the following is true:

register access = 4 instructions per cycle
L1 latency = 3 cycles (12 x register)
L2 latency = 12 cycles (4 x L1, 48 x register)
L3 latency = 38 cycles (3 x L2, 12 x L1, 144 x register)
DRAM latency = 65 ns = 195 cycles on a 3 GHz CPU (5 x L3, 15 x L2, 60 x L1, 720 x register)

The Gallery of Processor Cache Effects also makes good reading on this topic.

Mmmm, cookies ...

answered Oct 06 '22 00:10

yoyo

Related questions
                            
                                How do I get the intersection between two arrays as a new array?
                            
                                Delete a file named "NUL" on Windows
                            
                                Override a function call in C
                            
                                How do I make Makefile to recompile only changed files?
                            
                                What's the C++ suffix for long double literals?
                            
                                What's the correct way to use printf to print a size_t?
                            
                                GDB: Listing all mapped memory regions for a crashed process
                            
                                Why do you need to recompile C/C++ for each OS? [duplicate]
                            
                                Type conversion - unsigned to signed int/char
                            
                                Meaning of "referencing" and "dereferencing" in C
                            
                                Determine size of dynamically allocated memory in C
                            
                                Passing an array by reference in C?
                            
                                Endless loop in C/C++ [closed]
                            
                                Recursive mkdir() system call on Unix
                            
                                How to copy directory from source tree to binary tree?
                            
                                Poor memcpy Performance on Linux
                            
                                Getting a FILE* from a std::fstream
                            
                                What is C17 and what changes have been made to the language?
                            
                                Is char *envp[] as a third argument to main() portable
                            
                                How to write a signal handler to catch SIGSEGV?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the Cost of an L1 Cache Miss?

Tags:

c

caching

memory-access

Mark Wilkins

People also ask

1 Answers

yoyo

Recent Activity

Donate For Us