Measuring How Slow Atomic Increments are As Compared to Regular Integer Increments

Question

A recent discussion made me wonder how expensive an atomic increment is in comparison to a regular integer increment.

I wrote some code to try and benchmark this:

#include <iostream>
#include <atomic>
#include <chrono>

static const int NUM_TEST_RUNS = 100000;
static const int ARRAY_SIZE = 500;

void runBenchmark(std::atomic<int>& atomic_count, int* count_array, int array_size, bool do_atomic_increment){    
    for(int i = 0; i < array_size; ++i){
        ++count_array[i];        
    }

    if(do_atomic_increment){
        ++atomic_count;
    }
}

int main(int argc, char* argv[]){

    int num_test_runs = NUM_TEST_RUNS;
    int array_size = ARRAY_SIZE;

    if(argc == 3){
        num_test_runs = atoi(argv[1]);
        array_size = atoi(argv[2]);        
    }

    if(num_test_runs == 0 || array_size == 0){
        std::cout << "Usage: atomic_operation_overhead <num_test_runs> <num_integers_in_array>" << std::endl;
        return 1;   
    }

    // Instantiate atomic counter
    std::atomic<int> atomic_count;

    // Allocate the integer buffer that will be updated every time
    int* count_array = new int[array_size];

    // Track the time elapsed in case of incrmeenting with mutex locking
    auto start = std::chrono::steady_clock::now();
    for(int i = 0; i < num_test_runs; ++i){
        runBenchmark(atomic_count, count_array, array_size, true);        
    }
    auto end = std::chrono::steady_clock::now();

    // Calculate time elapsed for incrementing without mutex locking
    auto diff_with_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    std::cout << "Elapsed time with atomic increment for " 
              << num_test_runs << " test runs: "
              << diff_with_lock.count() << " ns" << std::endl;

    // Track the time elapsed in case of incrementing without a mutex locking
    start = std::chrono::steady_clock::now();
    for(unsigned int i = 0; i < num_test_runs; ++i){
        runBenchmark(atomic_count, count_array, array_size, false);
    }
    end = std::chrono::steady_clock::now();

    // Calculate time elapsed for incrementing without mutex locking
    auto diff_without_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    std::cout << "Elapsed time without atomic increment for " 
              << num_test_runs << " test runs: "
              << diff_without_lock.count() << " ns" << std::endl;

    auto difference_running_times = diff_with_lock - diff_without_lock;
    auto proportion = difference_running_times.count() / (double)diff_without_lock.count();          
    std::cout << "How much slower was locking: " << proportion * 100.0 << " %" << std::endl;           

    // We loop over all entries in the array and print their sum
    // We do this mainly to prevent the compiler from optimizing out
    // the loop where we increment all the values in the array
    int array_sum = 0;
    for(int i = 0; i < array_size; ++i){
        array_sum += count_array[i];
    }
    std::cout << "Array sum (just to prevent loop getting optimized out): " << array_sum << std::endl;

    delete [] count_array;

    return 0;
}

The trouble I am having is that this program produces widely divergent results in each run:

balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 99852 ns
Elapsed time without atomic increment for 1000 test runs: 96396 ns
How much slower was locking: 3.58521 %
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 182769 ns
Elapsed time without atomic increment for 1000 test runs: 138319 ns
How much slower was locking: 32.1359 %
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 98858 ns
Elapsed time without atomic increment for 1000 test runs: 96404 ns
How much slower was locking: 2.54554 %
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 107848 ns
Elapsed time without atomic increment for 1000 test runs: 105174 ns
How much slower was locking: 2.54245 %
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 113865 ns
Elapsed time without atomic increment for 1000 test runs: 100559 ns
How much slower was locking: 13.232 %
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 98956 ns
Elapsed time without atomic increment for 1000 test runs: 106639 ns
How much slower was locking: -7.20468 %

This causes me to believe that there might be a bug in the benchmarking code itself. Is there some error I am missing? Is my usage of std::chrono for benchmarking incorrect? Or is the time difference on account of the overhead to the signal handling in the OS pertaining to atomic operations?

What could I be doing wrong?

Test bed:

Intel® Core™ i7-4700MQ CPU @ 2.40GHz × 8  
8GB RAM 
GNU/Linux:Ubuntu LTS 14.04 (64 bit)
GCC version: 4.8.4    
Compilation: g++ -std=c++11 -O3 atomic_operation_overhead.cpp  -o atomic_operation_overhead

EDIT: Updated the test run output after compiling with -O3 optimization.

EDIT: After running the tests for an increased number of iterations and adding a loop sum to prevent optimization out of the looped increment as suggested by Adam, I get more convergent results:

balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7111974931 ns
Elapsed time without atomic increment for 99999999 test runs: 6938317779 ns
How much slower was locking: 2.50287 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7424952991 ns
Elapsed time without atomic increment for 99999999 test runs: 7262721866 ns
How much slower was locking: 2.23375 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7172114343 ns
Elapsed time without atomic increment for 99999999 test runs: 7030985219 ns
How much slower was locking: 2.00725 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7094552104 ns
Elapsed time without atomic increment for 99999999 test runs: 6971060941 ns
How much slower was locking: 1.77148 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7099907902 ns
Elapsed time without atomic increment for 99999999 test runs: 6970289856 ns
How much slower was locking: 1.85958 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7763604675 ns
Elapsed time without atomic increment for 99999999 test runs: 7229145316 ns
How much slower was locking: 7.39312 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7164534212 ns
Elapsed time without atomic increment for 99999999 test runs: 6994993609 ns
How much slower was locking: 2.42374 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc@Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7154697145 ns
Elapsed time without atomic increment for 99999999 test runs: 6997030700 ns
How much slower was locking: 2.25333 %
Array sum (just to prevent loop getting optimized out): 1215751192

Adam · Accepted Answer

Some thoughts:

Run more iterations, at least enough that it takes a few seconds. Your runs take milliseconds, so an IO interrupt could be enough to skew your results.
Print out the sum at the end. The compiler might be smart enough to optimize your loops far more than you think, so your code might be doing less work than you think. If the compiler sees the value is never read, it might delete your loops entirely.
Do the iteration all in one loop, as opposed to a loop that calls a function. While the compiler will probably inline your function call, it's best to not introduce another potential source of noise.
I'm sure you'll do this next, but add a threaded test. Might as well do it for both; you'll get a wrong sum in the non-atomic variable due to races, but at least you'll see the performance penalty you pay for consistency.

Measuring How Slow Atomic Increments are As Compared to Regular Integer Increments

Tags:

c++11

atomic

stdatomic

balajeerc

1 Answers

Adam

Recent Activity

Donate For Us

Measuring How Slow Atomic Increments are As Compared to Regular Integer Increments

Tags:

c++11

atomic

stdatomic

balajeerc

1 Answers

Adam

Related questions

Recent Activity

Donate For Us