Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a penalty for using static variables in C++11

In C++11, this:

const std::vector<int>& f() {
    static const std::vector<int> x { 1, 2, 3 };
    return x;
}

is thread-safe. However, is there an extra penalty for calling this function after the first time (i.e. when it is initialized) due to this extra thread-safe guarantee? I am wondering if the function will be slower than one using a global variable, because it has to acquire a mutex to check whether it's being initialized by another thread every time it is called, or something.

like image 957
user3175411 Avatar asked Jan 31 '14 19:01

user3175411


2 Answers

"The best intution to be ever had is 'I should measure this.'" So let's find out:

#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <numeric>
#include <vector>

namespace {
class timer {
    using hrc = std::chrono::high_resolution_clock;
    hrc::time_point start;

    static hrc::time_point now() {
      // Prevent memory operations from reordering across the
      // time measurement. This is likely overkill, needs more
      // research to determine the correct fencing.
      std::atomic_thread_fence(std::memory_order_seq_cst);
      auto t = hrc::now();
      std::atomic_thread_fence(std::memory_order_seq_cst);
      return t;
    }

public:
    timer() : start(now()) {}

    hrc::duration elapsed() const {
      return now() - start;
    }

    template <typename Duration>
    typename Duration::rep elapsed() const {
      return std::chrono::duration_cast<Duration>(elapsed()).count();
    }

    template <typename Rep, typename Period>
    Rep elapsed() const {
      return elapsed<std::chrono::duration<Rep,Period>>();
    }
};

const std::vector<int>& f() {
    static const auto x = std::vector<int>{ 1, 2, 3 };
    return x;
}

static const auto y = std::vector<int>{ 1, 2, 3 };
const std::vector<int>& g() {
    return y;
}

const unsigned long long n_iterations = 500000000;

template <typename F>
void test_one(const char* name, F f) {
  f(); // First call outside the timer.

  using value_type = typename std::decay<decltype(f()[0])>::type;
  std::cout << name << ": " << std::flush;

  auto t = timer{};
  auto sum = uint64_t{};
  for (auto i = n_iterations; i > 0; --i) {
    const auto& vec = f();
    sum += std::accumulate(begin(vec), end(vec), value_type{});
  }
  const auto elapsed = t.elapsed<std::chrono::milliseconds>();
  std::cout << elapsed << " ms (" << sum << ")\n";
}
} // anonymous namespace

int main() {
  test_one("local static", f);
  test_one("global static", g);
}

Running at Coliru, the local version does 5e8 iterations in 4618 ms, the global version in 4392 ms. So yes, the local version is slower by approximately 0.452 nanoseconds per iteration. Although there's a measurable difference, it's too small to impact observed performance in most situations.


EDIT: Interesting counterpoint, switching from clang++ to g++ changes the result ordering. The g++-compiled binary runs in 4418 ms (global) vs. 4181 ms (local) so local is faster by 474 picoseconds per iteration. It does nonetheless reaffirm the conclusion that the variance between the two methods is small.
EDIT 2: Examining the generated assembly, I decided to convert from function pointers to function objects for better inlining. Timing with indirect calls through function pointers isn't really characteristic of the code in the OP. So I used this program:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <numeric>
#include <vector>

namespace {
class timer {
    using hrc = std::chrono::high_resolution_clock;
    hrc::time_point start;

    static hrc::time_point now() {
      // Prevent memory operations from reordering across the
      // time measurement. This is likely overkill.
      std::atomic_thread_fence(std::memory_order_seq_cst);
      auto t = hrc::now();
      std::atomic_thread_fence(std::memory_order_seq_cst);
      return t;
    }

public:
    timer() : start(now()) {}

    hrc::duration elapsed() const {
      return now() - start;
    }

    template <typename Duration>
    typename Duration::rep elapsed() const {
      return std::chrono::duration_cast<Duration>(elapsed()).count();
    }

    template <typename Rep, typename Period>
    Rep elapsed() const {
      return elapsed<std::chrono::duration<Rep,Period>>();
    }
};

class f {
public:
    const std::vector<int>& operator()() {
        static const auto x = std::vector<int>{ 1, 2, 3 };
        return x;
    }
};

class g {
    static const std::vector<int> x;
public:
    const std::vector<int>& operator()() {
        return x;
    }
};

const std::vector<int> g::x{ 1, 2, 3 };

const unsigned long long n_iterations = 500000000;

template <typename F>
void test_one(const char* name, F f) {
  f(); // First call outside the timer.

  using value_type = typename std::decay<decltype(f()[0])>::type;
  std::cout << name << ": " << std::flush;

  auto t = timer{};
  auto sum = uint64_t{};
  for (auto i = n_iterations; i > 0; --i) {
    const auto& vec = f();
    sum += std::accumulate(begin(vec), end(vec), value_type{});
  }
  const auto elapsed = t.elapsed<std::chrono::milliseconds>();
  std::cout << elapsed << " ms (" << sum << ")\n";
}
} // anonymous namespace

int main() {
  test_one("local static", f());
  test_one("global static", g());
}

Not surprisingly, runtimes were faster under both g++ (3803ms local, 2323ms global) and clang (4183ms local, 3253ms global). The results affirm our intuition that the global technique should be faster than the local, with deltas of 2.96 nanoseconds (g++) and 1.86 nanoseconds (clang) per iteration.

like image 105
Casey Avatar answered Oct 01 '22 23:10

Casey


Yes, there will be a cost to check whether the object has been initialised. This would typically test an atomic Boolean variable, rather than lock a mutex.

like image 29
Mike Seymour Avatar answered Oct 01 '22 22:10

Mike Seymour