Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is VC++ unable to optimize an integer wrapper?

In C++, i'm trying to write a wrapper around a 64 bits integer. My expectation is that if written correctly and all methods are inlined such a wrapper should be as performant as the real type. Answer to this question on SO seems to agree with my expectation.

I wrote this code to test my expectation :

class B
{
private:
   uint64_t _v;

public:
   inline B() {};
   inline B(uint64_t v) : _v(v) {};

   inline B& operator=(B rhs) { _v = rhs._v; return *this; };
   inline B& operator+=(B rhs) { _v += rhs._v; return *this; };
   inline operator uint64_t() const { return _v; };
};

int main(int argc, char* argv[])
{
   typedef uint64_t;
   //typedef B T;
   const unsigned int x = 100000000;

   Utils::CTimer timer;
   timer.start();

   T sum = 0;
   for (unsigned int i = 0; i < 100; ++i)
   {
      for (uint64_t f = 0; f < x; ++f)
      {
         sum += f;
      }
   }

   float time = timer.GetSeconds();

   cout << sum << endl
        << time << " seconds" << endl;

   return 0;
}

When I run this with typedef B T; instead of typedef uint64_t T the reported times are consistently 10% slower when compiled with VC++. With g++ the performances are same if I use the wrapper or not.

Since g++ does it I guess there is no technical reason why VC++ can not optimise this correctly. Is there something I could do to make it optimize it?

I already tried to play with the optimisations flag with no success

like image 373
Mathieu Pagé Avatar asked Feb 04 '15 13:02

Mathieu Pagé


2 Answers

For the record, this is what g++ and clang++'s generated assembly at -O2 translates to (in both wrapper and non-wrapper cases), modulo the timing part:

sum = 499999995000000000;
cout << sum << endl;

In other words, it optimized the loop out entirely. Regardless of how hard you try to vectorize the loop, it's rather hard to beat not looping at all :)

like image 58
T.C. Avatar answered Oct 16 '22 05:10

T.C.


Using /O2 (maximize speed), both alternatives generate exactly the same assembly using Visual Studio 2012. This is your code, minus the timing and output:

00FB1000  push        ebp  
00FB1001  mov         ebp,esp  
00FB1003  and         esp,0FFFFFFF8h  
00FB1006  sub         esp,8  
00FB1009  mov         edx,64h  
00FB100E  mov         edi,edi  
00FB1010  xorps       xmm0,xmm0  
00FB1013  movlpd      qword ptr [esp],xmm0  
00FB1018  mov         ecx,dword ptr [esp+4]  
00FB101C  mov         eax,dword ptr [esp]  
00FB101F  nop  
00FB1020  add         eax,1  
00FB1023  adc         ecx,0  
00FB1026  jne         main+2Fh (0FB102Fh)  
00FB1028  cmp         eax,5F5E100h  
00FB102D  jb          main+20h (0FB1020h)  
00FB102F  dec         edx  
00FB1030  jne         main+10h (0FB1010h)  
00FB1032  xor         eax,eax

I'd presume that the measured times fluctuate or are not always correct.

like image 28
Daerst Avatar answered Oct 16 '22 06:10

Daerst