In C++, i'm trying to write a wrapper around a 64 bits integer. My expectation is that if written correctly and all methods are inlined such a wrapper should be as performant as the real type. Answer to this question on SO seems to agree with my expectation.
I wrote this code to test my expectation :
class B
{
private:
   uint64_t _v;
public:
   inline B() {};
   inline B(uint64_t v) : _v(v) {};
   inline B& operator=(B rhs) { _v = rhs._v; return *this; };
   inline B& operator+=(B rhs) { _v += rhs._v; return *this; };
   inline operator uint64_t() const { return _v; };
};
int main(int argc, char* argv[])
{
   typedef uint64_t;
   //typedef B T;
   const unsigned int x = 100000000;
   Utils::CTimer timer;
   timer.start();
   T sum = 0;
   for (unsigned int i = 0; i < 100; ++i)
   {
      for (uint64_t f = 0; f < x; ++f)
      {
         sum += f;
      }
   }
   float time = timer.GetSeconds();
   cout << sum << endl
        << time << " seconds" << endl;
   return 0;
}
When I run this with typedef B T; instead of typedef uint64_t T the reported times are consistently 10% slower when compiled with VC++. With g++ the performances are same if I use the wrapper or not.
Since g++ does it I guess there is no technical reason why VC++ can not optimise this correctly. Is there something I could do to make it optimize it?
I already tried to play with the optimisations flag with no success
For the record, this is what g++ and clang++'s generated assembly at -O2 translates to (in both wrapper and non-wrapper cases), modulo the timing part:
sum = 499999995000000000;
cout << sum << endl;
In other words, it optimized the loop out entirely. Regardless of how hard you try to vectorize the loop, it's rather hard to beat not looping at all :)
Using /O2 (maximize speed), both alternatives generate exactly the same assembly using Visual Studio 2012. This is your code, minus the timing and output:
00FB1000  push        ebp  
00FB1001  mov         ebp,esp  
00FB1003  and         esp,0FFFFFFF8h  
00FB1006  sub         esp,8  
00FB1009  mov         edx,64h  
00FB100E  mov         edi,edi  
00FB1010  xorps       xmm0,xmm0  
00FB1013  movlpd      qword ptr [esp],xmm0  
00FB1018  mov         ecx,dword ptr [esp+4]  
00FB101C  mov         eax,dword ptr [esp]  
00FB101F  nop  
00FB1020  add         eax,1  
00FB1023  adc         ecx,0  
00FB1026  jne         main+2Fh (0FB102Fh)  
00FB1028  cmp         eax,5F5E100h  
00FB102D  jb          main+20h (0FB1020h)  
00FB102F  dec         edx  
00FB1030  jne         main+10h (0FB1010h)  
00FB1032  xor         eax,eax
I'd presume that the measured times fluctuate or are not always correct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With