Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why the second time iterating on large number of bytes is significantly slower? And how to fix it?

This code:

#include <memory>
#include <time.h>
#include <chrono>
#include <thread>
#include <stdio.h>
#include <stdlib.h>

void Test( ) {
#define current_milliseconds std::chrono::duration_cast<std::chrono::milliseconds>( std::chrono::system_clock::now( ).time_since_epoch( ) ).count( )
    int *c = ( int* )malloc( 1024 * 1024 * 1024 );
    int result = 0;
    auto millis = -current_milliseconds;
    //clock_t timer = -clock( );
    for ( int i = 0 ; i < 1024 * 1024 * 256 /* 1024 / 4 */; ++i )
        result += c[ i ];
    millis += current_milliseconds;
    printf( "Took: %ldms (JUST PASSING BY: %d)\n", millis, result );
    free( c );
#undef current_milliseconds
}

int main( ) {
    std::this_thread::sleep_for( std::chrono::milliseconds( 1 ) );
    Test( );
    std::this_thread::sleep_for( std::chrono::milliseconds( 1 ) );
    Test( );
    return -1;
}

I ran 7 tests and gave the last 6 outputs:

Took: 502ms (JUST PASSING BY: 0)
Took: 607ms (JUST PASSING BY: 0)

Took: 480ms (JUST PASSING BY: 0)
Took: 588ms (JUST PASSING BY: 0)

Took: 492ms (JUST PASSING BY: 0)
Took: 562ms (JUST PASSING BY: 0)

Took: 506ms (JUST PASSING BY: 0)
Took: 558ms (JUST PASSING BY: 0)

Took: 470ms (JUST PASSING BY: 0)
Took: 555ms (JUST PASSING BY: 0)

Took: 510ms (JUST PASSING BY: 0)
Took: 562ms (JUST PASSING BY: 0)

If your output is different, then try to run the executable again (Hard-disk cache misses) or try to enlarge the number of iterations and allocated bytes (Have a feeling).

Notice that the timer's code range is only on the loop and not the allocation too; then there goes the question again: why the second iteration is slower? Is there a way to fix it?

Additional Information:

  1. That PC has pentium 2.8GHZ @ 2 cores (Intel E6300) processor, 4GB RAM (had 2.2GB free RAM before executing the tests), and enterprise Intel SSD.
  2. It seem that while executing the tests, it wrote a few of 100MBs. Why it did when it had enough free RAM? (I deallocated 1GB and then allocated another 1GB, it shouldn't pass pre-swapfile it)
like image 940
LyingOnTheSky Avatar asked Mar 11 '15 13:03

LyingOnTheSky


2 Answers

Output on my machine:

Took: 371ms (JUST PASSING BY: 0)
Took: 318ms (JUST PASSING BY: 0)

Somewhat more typical for what I expect most programmers to see when they try your program. You can make a small change to get a vastly different outcome:

int *c = (int*)malloc(1024 * 1024 * 1024);
memset(c, 0, 1024 * 1024 * 1024);          // <== added
// etc..

Which produces on my machine:

Took: 104ms (JUST PASSING BY: 0)
Took: 102ms (JUST PASSING BY: 0)

A fat x3 speed-up, just from initializing the memory content. Hope it reproduces on your machine, it should. First conclusion that you need to draw is that you've been bench-marking something completely different than the cost of your code. A very typical benchmark hazard on modern machines.

This is the behavior of a demand-paged virtual memory operating system. Like Windows, Linux or OSX. Your malloc() call never actually allocated any memory, it merely reserved address space. Just numbers to the processor, there is one for each 4096 bytes of memory. You don't pay the cost of using memory until later, when you actually address it. When the demand-paged feature comes into play.

Which happens in your result += c[ i ]; statement. At that point the pedal must meet the metal and the operating system is forced to actually make the memory available. Every 4096 bytes your program generates a page fault. The operating system steps in and maps 4096 bytes of RAM to the virtual memory page. Your program generates 1GB / 4096 = 262,144 of those page faults. You can conclude that your operating system requires roughly 400ms/262144 ~= 1.5 microseconds to handle a page fault. Mine is about twice as fast at it.

Note how the memset() call hid that cost, it generated all those page faults before you started to time the code execution. Thus truly measuring the cost of the code and avoiding the otherwise inevitable initialization overhead.

How long your first run takes is going to depend on how quickly the operating system can make the RAM available. Actual measurement can vary greatly from one try to the next, it depends on how many other processes have mapped RAM pages. Might take quite a while if the OS needs to find space and unmap pages first, preserving their content in the paging file.

How long the second run takes is going to depend on how quickly the operating system can recycle the RAM pages if there are not enough of them available to map another gigabyte. Not much of problem on mine, I have 8 GB of RAM and using only 5.6 of it right now. They need to be zero-initialized, a low priority duty on a typical OS.

So, basic conclusion here:

  • You are measuring an initialization cost, it is not at all representative for how your program is going to be affected while it continues to use the memory. Your benchmark, as-is, doesn't tell you anything.
  • The difference you observed is an operating system implementation detail, it doesn't have anything to do with your program.
like image 112
Hans Passant Avatar answered Sep 27 '22 21:09

Hans Passant


You're accessing pages for the first time on that first lap, and that's going quicker than the second, so whatever memory management your OS does on a freshly-allocated gigabyte is cheaper the first time.

If your OS maps newly-allocated storage to a page full of zeros and maps newly-referenced memory to its own page on access rather than on write, and it doesn't map a gigabyte allocation to the big 2M pages then the second lap through your gigabyte you'll be TLB missing pretty much every 4K. That can get ... not cheap.

So I'm thinking that whatever your OS is doing to map a newly-referenced page might be faster than a full TLB miss. That would easily happen if, for instance, PTEs are cacheable and every time the OS allocates a new bottom-level table it clears it, allowing your TLB misses to reload from pre-primed cache lines, recently initialized on-chip, where by the second lap your serial misses would have long since flushed the entries and the CPU has to fetch them over that long, long bus (which at this level of concern might better be viewed as a network connection).

I don't know the x86 performance counters or tools to check them, but if you've got access to decent tooling it should be easy to check theories by looking at the right counter deltas.

like image 30
jthill Avatar answered Sep 27 '22 20:09

jthill