Poor performance of vector in 64-bit target with VS2012

Question

Benchmarking this class:

struct Sieve {
  std::vector<bool> isPrime;

  Sieve (int n = 1) {
    isPrime.assign (n+1, true);
    isPrime[0] = isPrime[1] = false;

    for (int i = 2; i <= (int)sqrt((double)n); ++i)
      if (isPrime[i]) 
        for (int j = i*i; j <= n; j += i)
          isPrime[j] = false;
  }
};

I'm getting over 3 times worse performance (CPU time) with 64-bit binary vs. 32-bit version (release build) when calling a constructor for a large number, e.g.

Sieve s(100000000);

I tested sizeof(bool) and it is 1 for both versions. When I substitute vector<bool> with vector<char> performance becomes the same for 64-bit and 32-bit versions. Why is that?

Here are the run times for S(100000000) (release mode, 32-bit first, 64-bit second)):

vector<bool> 0.97s 3.12s vector<char> 0.99s 0.99s vector<int> 1.57s 1.59s

I also did a sanity test with VS2010 (prompted by Wouter Huysentruit's response), which produced 0.98s 0.88s. So there is something wrong with VS2012 implementation.

I submitted a bug report to Microsoft Connect

EDIT

Many answers below comment on deficiencies of using int for indexing. This may be true, but even the Great Wizard himself is using a standard for (int i = 0; i < v.size(); ++i) in his books, so such a pattern should not incur a significant performance penalty. Additionally, this issue was raised during Going Native 2013 conference and the presiding group of C++ gurus commented on their early recommendations of using size_t for indexing and as a return type of size() as a mistake. They said: "we are sorry, we were young..."

The title of this question could be rephrased to: Over 3 times performance drop on this code when upgrading from VS2010 to VS2012.

EDIT

I made a crude attempt at finding memory alignment of indexes i and j and discovered that this instrumented version:

struct Sieve {
  vector<bool> isPrime;

  Sieve (int n = 1) {
    isPrime.assign (n+1, true);
    isPrime[0] = isPrime[1] = false;

    for (int i = 2; i <= sqrt((double)n); ++i) {
      if (i == 17) cout << ((int)&i)%16 << endl;
      if (isPrime[i]) 
        for (int j = i*i; j <= n; j += i) {
          if (j == 4) cout << ((int)&j)%16 << endl;
          isPrime[j] = false;
        }
    }
  }
};

auto-magically runs fast now (only 10% slower than 32-bit version). This and VS2010 performance makes it hard to accept a theory of optimizer having inherent problems dealing with int indexes instead of size_t.

James McNellis · Accepted Answer

std::vector<bool> is not directly at fault here. The performance difference is ultimately caused by your use of the signed 32-bit int type in your loops and some rather poor register allocation by the compiler. Consider, for example, your innermost loop:

for (int j = i*i; j <= n; j += i)
    isPrime[j] = false;

Here, j is a 32-bit signed integer. When it is used in isPrime[j], however, it must be promoted (and sign-extended) to a 64-bit integer, in order to perform the subscript computation. The compiler can't just treat j as a 64-bit value, because that would change the behavior of the loop (e.g. if n is negative). The compiler also can't perform the index computation using the 32-bit quantity j, because that would change the behavior of that expression (e.g. if j is negative).

So, the compiler needs to generate code for the loop using a 32-bit j then it must generate code to convert that j to a 64-bit integer for the subscript computation. It has to do the same thing for the outer loop with i. Unfortunately, it looks like the compiler allocates registers rather poorly in this case^(*)--it starts spilling temporaries to the stack, causing the performance hit you see.

If you change your program to use size_t everywhere (which is 32-bit on x86 and 64-bit on x64), you will observe that the performance is on par with x86, because the generated code needs only work with values of one size:

Sieve (size_t n = 1)
{
    isPrime.assign (n+1, true);
    isPrime[0] = isPrime[1] = false;

    for (size_t i = 2; i <= static_cast<size_t>(sqrt((double)n)); ++i)
        if (isPrime[i]) 
            for (size_t j = i*i; j <= n; j += i)
                isPrime[j] = false;
}

You should do this anyway, because mixing signed and unsigned types, especially when those types are of different widths, is perilous and can lead to unexpected errors.

Note that using std::vector<char> also "solves" the problem, but for a different reason: the subscript computation required for accessing an element of a std::vector<char> is substantially simpler than that for accessing an element of std::vector<bool>. The optimizer is able to generate better code for the simpler computations.

(*) I don't work on code generation, and I'm hardly an expert in either assembly or low-level performance optimization, but from looking at the generated code, and given that it is reported here that Visual C++ 2010 generates better code, I'd guess that there are opportunities for improvement in the compiler. I'll make sure the Connect bug you opened gets forwarded on to the compiler team so they can take a look.

huysentruitw · Answer

I've tested this with vector<bool> in VS2010: 32-bit needs 1452ms while 64-bit needs 1264ms to complete on a i3.

The same test in VS2012 (on i7 this time) needs 700ms (32-bit) and 2730ms (64-bit), so there is something wrong with the compiler in VS2012. Maybe you can report this test case as a bug to Microsoft.

UPDATE

The problem is that the VS2012 compiler uses a temporary stack variable for a part of the code in the inner for-loop when using int as iterator. The assembly parts listed below are part of the code inside <vector>, in the += operator of the std::vector<bool>::iterator.

size_t as iterator

When using size_t as iterator, a part of the code looks like this:

or  rax, -1
sub rax, rdx
shr rax, 5
lea rax, QWORD PTR [rax*4+4]
sub r8, rax

Here, all instructions use CPU registers which are very fast.

int as iterator

When using int as iterator, that same part looks like this:

or  rcx, -1
sub rcx, r8
shr rcx, 5
shl rcx, 2
mov rax, -4
sub rax, rcx
mov rdx, QWORD PTR _Tmp$6[rsp]
add rdx, rax

Here you see the _Tmp$6 stack variable being used, which causes the slowdown.

Point compiler into the right direction

The funny part is that you can point the compiler into the right direction by using the vector<bool>::iterator directly.

struct Sieve {
  std::vector<bool> isPrime;

  Sieve (int n = 1) {
    isPrime.assign(n + 1, true);

    std::vector<bool>::iterator it1 = isPrime.begin();
    std::vector<bool>::iterator end = it1 + n;
    *it1++ = false;
    *it1++ = false;
    for (int i = 2; i <= (int)sqrt((double)n); ++it1, ++i)
        if (*it1)
            for (std::vector<bool>::iterator it2 = isPrime.begin() + i * i; it2 <= end; it2 += i)
                *it2 = false;
  }
};

Poor performance of vector<bool> in 64-bit target with VS2012

Tags:

c++

performance

boolean

vector

Paul Jurczak

2 Answers

James McNellis

huysentruitw

Recent Activity

Donate For Us