Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a placement new much faster than a direct assignment?

Tags:

I recently figured out that using a placement new is faster than doing 16 assignments:
Consider the following piece of code (c++11):

class Matrix { public:     double data[16];      Matrix() : data{ 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1 }     {     };      void Identity1()     {         new (this) Matrix();     };      void Identity2()     {         data[0]  = 1.0; data[1]  = 0.0; data[2]  = 0.0; data[3]  = 0.0;         data[4]  = 0.0; data[5]  = 1.0; data[6]  = 0.0; data[7]  = 0.0;         data[8]  = 0.0; data[9]  = 0.0; data[10] = 1.0; data[11] = 0.0;         data[12] = 0.0; data[13] = 0.0; data[14] = 0.0; data[15] = 1.0;     }; }; 

Usage:

Matrix m; //modify m.data  m.Identity1(); //~25 times faster m.Identity2(); 

On my machine Identity1() is about 25 times faster than the second function. And now im curious why there is such a big difference?

I also tried a third one:

void Identity3() {     memset(data, 0, sizeof(double) * 16);     data[0] = 1.0;     data[5] = 1.0;     data[10] = 1.0;     data[15] = 1.0; }; 

But this is even slower than Identity2() and i can't imagine why.


Profiling information

I have done several profiling tests to see if it's an profiling-related issue, so there is the default 'for loop' test but also external profiling tests:

Profiling method 1: (the well known for loop test)

struct timespec ts1; struct timespec ts2;  clock_gettime(CLOCK_MONOTONIC, &ts1);  for (volatile int i = 0; i < 10000000; i++)     m.Identity(); //use 1 or 2 here  clock_gettime(CLOCK_MONOTONIC, &ts2);  int64_t start = (int64_t)ts1.tv_sec * 1000000000 + (int64_t)ts1.tv_nsec; int64_t elapsed = ((int64_t)ts2.tv_sec * 1000000000 + (int64_t)ts2.tv_nsec) - start;  if (elapsed < 0)     elapsed += (int64_t)0x100000 * 1000000000;  printf("elapsed nanos: %ld\n", elapsed); 

Method 2:

$ valgrind --tool=callgrind ./testcase  $ # for better overview: $ python2 gprof2dot.py -f callgrind.out.22028 -e 0.0 -n 0.0 | dot -Tpng -o tree.png 

Assembly information

As the user T.C. stated in the comments, this might be helpful:

http://goo.gl/LC0RdG


Compilation and machine info

Compiled with: g++ --std=c++11 -O3 -g -pg -Wall

-pg is not the issue. Got the same time-difference in measurement method 1 without using this flag.

Machine info (lscpu):  Architecture:          x86_64 CPU op-mode(s):        32-bit, 64-bit Byte Order:            Little Endian CPU(s):                8 On-line CPU(s) list:   0-7 Thread(s) per core:    2 Core(s) per socket:    4 Socket(s):             1 NUMA node(s):          1 Vendor ID:             GenuineIntel CPU family:            6 Model:                 58 Model name:            Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz Stepping:              9 CPU MHz:               2889.878 CPU max MHz:           3100.0000 CPU min MHz:           1200.0000 BogoMIPS:              4192.97 Virtualization:        VT-x L1d cache:             32K L1i cache:             32K L2 cache:              256K L3 cache:              6144K NUMA node0 CPU(s):     0-7 
like image 523
bricklore Avatar asked Aug 26 '15 09:08

bricklore


2 Answers

Whatever your 25x time difference was measuring, it's not actually the difference between the two Identity() implementations.

With your timing code, both versions compile to exactly the same asm: an empty loop. The code you posted never uses m, so it gets optimized away. All that happens is loads/stores of the loop counter. (This happens because you used volatile int, to tell gcc that the variable is stored in memory-mapped I/O space, so all reads/writes of it appearing in the source must actually appear in the asm. MSVC has a different meaning for the volatile keyword, which goes beyond what the standard says.)

Have a look at the asm on godbolt. Here is your code, and the asm it turns into:

for (volatile int i = 0; i < 10000000; i++)     m.Identity1(); // same output for gcc 4.8.2 through gcc 5.2.0, with -O3  # some setup before this loop:  mov $0, 8(%rsp)  then test if it reads back as 0 .L16:     movl    8(%rsp), %eax     addl    $1, %eax     movl    %eax, 8(%rsp)     movl    8(%rsp), %eax     cmpl    $9999999, %eax     jle .L16 

  for (volatile int i = 0; i < 10000000; i++)     m.Identity2();  # some setup before this loop:  mov $0, 12(%rsp)  then test if it reads back as 0 .L15:     movl    12(%rsp), %eax     addl    $1, %eax     movl    %eax, 12(%rsp)     movl    12(%rsp), %eax     cmpl    $9999999, %eax     jle .L15 

As you can see, neither one calls either version of the Identity() function.

It's interesting to see in the asm for Identity1 that it uses integer movq for assignment of zeros, while Identity2 only uses scalar FP moves. This may have something to do with using 0.0 vs. 0, or it may be due to in-place new vs. simple assignment.

I see either way, gcc 5.2.0 doesn't vectorize the Identity functions unless you use -march=native. (In which case it uses AVX 32B loads/stores to copy from 4x 32B of data. Nothing clever like byte-shifting the registers to move the 1.0 to a different spot :/)

If gcc was smarter, it would do a 16B store of two zeros, and do that instead of two movsd. Maybe it's assuming unaligned, and the downside for a cacheline or page-line split on an unaligned store is a lot worse than the upside of saving a store insn if it is aligned.


So whatever you timed with that code, it wasn't your functions. Unless one of them did the Identity, and the other didn't. Either way, lose the volatile from your loop counter, that's totally silly. Just look at the extra loads/stores in the empty loops because of it.

like image 195
Peter Cordes Avatar answered Oct 04 '22 14:10

Peter Cordes


I bet you get the same performance if you memcopy a const-expr array manually:

static constexpr double identity_data[16] = { 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1 };  void Identity3() {     std::copy(std::begin(identity_data), std::end(identity_data), data); } 
like image 34
Thomas B. Avatar answered Oct 04 '22 15:10

Thomas B.