I recently figured out that using a placement new is faster than doing 16 assignments:
Consider the following piece of code (c++11):
class Matrix { public: double data[16]; Matrix() : data{ 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1 } { }; void Identity1() { new (this) Matrix(); }; void Identity2() { data[0] = 1.0; data[1] = 0.0; data[2] = 0.0; data[3] = 0.0; data[4] = 0.0; data[5] = 1.0; data[6] = 0.0; data[7] = 0.0; data[8] = 0.0; data[9] = 0.0; data[10] = 1.0; data[11] = 0.0; data[12] = 0.0; data[13] = 0.0; data[14] = 0.0; data[15] = 1.0; }; };
Usage:
Matrix m; //modify m.data m.Identity1(); //~25 times faster m.Identity2();
On my machine Identity1()
is about 25 times faster than the second function. And now im curious why there is such a big difference?
I also tried a third one:
void Identity3() { memset(data, 0, sizeof(double) * 16); data[0] = 1.0; data[5] = 1.0; data[10] = 1.0; data[15] = 1.0; };
But this is even slower than Identity2()
and i can't imagine why.
I have done several profiling tests to see if it's an profiling-related issue, so there is the default 'for loop' test but also external profiling tests:
Profiling method 1: (the well known for loop test)
struct timespec ts1; struct timespec ts2; clock_gettime(CLOCK_MONOTONIC, &ts1); for (volatile int i = 0; i < 10000000; i++) m.Identity(); //use 1 or 2 here clock_gettime(CLOCK_MONOTONIC, &ts2); int64_t start = (int64_t)ts1.tv_sec * 1000000000 + (int64_t)ts1.tv_nsec; int64_t elapsed = ((int64_t)ts2.tv_sec * 1000000000 + (int64_t)ts2.tv_nsec) - start; if (elapsed < 0) elapsed += (int64_t)0x100000 * 1000000000; printf("elapsed nanos: %ld\n", elapsed);
Method 2:
$ valgrind --tool=callgrind ./testcase $ # for better overview: $ python2 gprof2dot.py -f callgrind.out.22028 -e 0.0 -n 0.0 | dot -Tpng -o tree.png
As the user T.C. stated in the comments, this might be helpful:
http://goo.gl/LC0RdG
Compiled with:
g++ --std=c++11 -O3 -g -pg -Wall
-pg
is not the issue. Got the same time-difference in measurement method 1 without using this flag.
Machine info (lscpu): Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Model name: Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz Stepping: 9 CPU MHz: 2889.878 CPU max MHz: 3100.0000 CPU min MHz: 1200.0000 BogoMIPS: 4192.97 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7
Whatever your 25x time difference was measuring, it's not actually the difference between the two Identity()
implementations.
With your timing code, both versions compile to exactly the same asm: an empty loop. The code you posted never uses m
, so it gets optimized away. All that happens is loads/stores of the loop counter. (This happens because you used volatile int
, to tell gcc that the variable is stored in memory-mapped I/O space, so all reads/writes of it appearing in the source must actually appear in the asm. MSVC has a different meaning for the volatile
keyword, which goes beyond what the standard says.)
Have a look at the asm on godbolt. Here is your code, and the asm it turns into:
for (volatile int i = 0; i < 10000000; i++) m.Identity1(); // same output for gcc 4.8.2 through gcc 5.2.0, with -O3 # some setup before this loop: mov $0, 8(%rsp) then test if it reads back as 0 .L16: movl 8(%rsp), %eax addl $1, %eax movl %eax, 8(%rsp) movl 8(%rsp), %eax cmpl $9999999, %eax jle .L16
for (volatile int i = 0; i < 10000000; i++) m.Identity2(); # some setup before this loop: mov $0, 12(%rsp) then test if it reads back as 0 .L15: movl 12(%rsp), %eax addl $1, %eax movl %eax, 12(%rsp) movl 12(%rsp), %eax cmpl $9999999, %eax jle .L15
As you can see, neither one calls either version of the Identity()
function.
It's interesting to see in the asm for Identity1
that it uses integer movq
for assignment of zeros, while Identity2
only uses scalar FP moves. This may have something to do with using 0.0 vs. 0, or it may be due to in-place new
vs. simple assignment.
I see either way, gcc 5.2.0 doesn't vectorize the Identity
functions unless you use -march=native
. (In which case it uses AVX 32B loads/stores to copy from 4x 32B of data. Nothing clever like byte-shifting the registers to move the 1.0 to a different spot :/)
If gcc was smarter, it would do a 16B store of two zeros, and do that instead of two movsd
. Maybe it's assuming unaligned, and the downside for a cacheline or page-line split on an unaligned store is a lot worse than the upside of saving a store insn if it is aligned.
So whatever you timed with that code, it wasn't your functions. Unless one of them did the Identity
, and the other didn't. Either way, lose the volatile
from your loop counter, that's totally silly. Just look at the extra loads/stores in the empty loops because of it.
I bet you get the same performance if you memcopy a const-expr array manually:
static constexpr double identity_data[16] = { 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1 }; void Identity3() { std::copy(std::begin(identity_data), std::end(identity_data), data); }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With