Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ vs Java? Why does the ICC generate slower code than VC? [closed]

The following is a simple loop in C++. The timer is using QueryPerformanceCounter() and is quite accurate. I found Java to take 60% of the time C++ takes and this can't be?! What am I doing wrong here? Even strict aliasing (which is not included in the code here) doesn't help at all...

long long var = 0; std::array<int, 1024> arr; int* arrPtr = arr.data(); CHighPrecisionTimer timer;  for(int i = 0; i < 1024; i++) arrPtr[i] = i;  timer.Start();  for(int i = 0; i < 1024 * 1024 * 10; i++){     for(int x = 0; x < 1024; x++){         var += arrPtr[x];     } }  timer.Stop();  printf("Unrestricted: %lld us, Value = %lld\n", (Int64)timer.GetElapsed().GetMicros(), var); 

This C++ runs through in about 9.5 seconds. I am using the Intel Compiler 12.1 with host processor optimization (specifically for mine) and everything maxed. So this is Intel Compiler at its best! Auto-Parallelization funnily consumes 70% CPU instead of 25% but doesn't get the job done any faster ;)...

Now I use the following Java code for comparison:

    long var = 0;     int[] arr = new int[1024];      for(int i = 0; i < 1024; i++) arr[i] = i;      for(int i = 0; i < 1024 * 1024; i++){         for(int x = 0; x < 1024; x++){             var += arr[x];         }     }      long nanos = System.nanoTime();      for(int i = 0; i < 1024 * 1024 * 10; i++){         for(int x = 0; x < 1024; x++){             var += arr[x];         }     }      nanos = (System.nanoTime() - nanos) / 1000;      System.out.print("Value: " + var + ", Time: " + nanos); 

The Java code is invoked with aggressive optimization and the server VM (no debug). It runs in about 7 seconds on my machine (only uses one thread).

Is this a failure of the Intel Compiler or am I just too dumb again?

[EDIT]: Ok now heres the thing... Seems more like a bug in the Intel compiler ^^. [Please note that I am running on the Intel Quadcore Q6600, which is rather old. And it might be that the Intel Compiler performs way better on recent CPUs, like Core i7]

Intel x86 (without vectorization): 3 seconds MSVC x64: 5 seconds Java x86/x64 (Oracle Java 7): 7 seconds Intel x64 (with vectorization): 9.5 seconds Intel x86 (with vectorization): 9.5 seconds Intel x64 (without vectorization): 12 seconds MSVC x86: 15 seconds (uhh) 

[EDIT]: Another nice case ;). Consider the following trivial lambda expression

#include <stdio.h> #include <tchar.h> #include <Windows.h> #include <vector> #include <boost/function.hpp> #include <boost/lambda/bind.hpp> #include <boost/typeof/typeof.hpp>  template<class TValue> struct ArrayList { private:     std::vector<TValue> m_Entries; public:      template<class TCallback>     void Foreach(TCallback inCallback)     {         for(int i = 0, size = m_Entries.size(); i < size; i++)         {             inCallback(i);         }     }      void Add(TValue inValue)     {         m_Entries.push_back(inValue);     } };  int _tmain(int argc, _TCHAR* argv[]) {     auto t = [&]() {};       ArrayList<int> arr;     int res = 0;      for(int i = 0; i < 100; i++)     {         arr.Add(i);     }      long long freq, t1, t2;      QueryPerformanceFrequency((LARGE_INTEGER*)&freq);     QueryPerformanceCounter((LARGE_INTEGER*)&t1);      for(int i = 0; i < 1000 * 1000 * 10; i++)     {         arr.Foreach([&](int v) {             res += i;         });     }      QueryPerformanceCounter((LARGE_INTEGER*)&t2);      printf("Time: %lld\n", ((t2-t1) * 1000000) / freq);      if(res == 4950)         return -1;      return 0; } 

Intel compiler shines again:

MSVC x86/x64: 12 milli seconds Intel x86/x64: 1 second 

Uhm?! Well, I guess 90 times slower is not a bad thing...

I am not really sure anymore that this applies: Okay and based on an answer to this thread: The intel compiler is known (and I knew that too but I just didn't think about that they could drop support for their processors) to have terrible performance on processors which are not "known" to the compiler, like AMD processors, and maybe even outdated Intel processors like mine... So if someone with a recent Intel processor could try this out it would be nice ;).

Here is the x64 output of the Intel Compiler:

    std::array<int, 1024> arr;     int* arrPtr = arr.data();     QueryPerformanceFrequency((LARGE_INTEGER*)&freq); 000000013F05101D  lea         rcx,[freq]   000000013F051022  call        qword ptr [__imp_QueryPerformanceFrequency (13F052000h)]        for(int i = 0; i < 1024; i++) arrPtr[i] = i; 000000013F051028  mov         eax,4   000000013F05102D  movd        xmm0,eax   000000013F051031  xor         eax,eax   000000013F051033  pshufd      xmm1,xmm0,0   000000013F051038  movdqa      xmm0,xmmword ptr [__xi_z+28h (13F0521A0h)]   000000013F051040  movdqa      xmmword ptr arr[rax*4],xmm0   000000013F051046  paddd       xmm0,xmm1   000000013F05104A  movdqa      xmmword ptr [rsp+rax*4+60h],xmm0   000000013F051050  paddd       xmm0,xmm1   000000013F051054  movdqa      xmmword ptr [rsp+rax*4+70h],xmm0   000000013F05105A  paddd       xmm0,xmm1   000000013F05105E  movdqa      xmmword ptr [rsp+rax*4+80h],xmm0   000000013F051067  add         rax,10h   000000013F05106B  paddd       xmm0,xmm1   000000013F05106F  cmp         rax,400h   000000013F051075  jb          wmain+40h (13F051040h)        QueryPerformanceCounter((LARGE_INTEGER*)&t1); 000000013F051077  lea         rcx,[t1]   000000013F05107C  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]               var += arrPtr[x]; 000000013F051082  movdqa      xmm1,xmmword ptr [__xi_z+38h (13F0521B0h)]        for(int i = 0; i < 1024 * 1024 * 10; i++){ 000000013F05108A  xor         eax,eax               var += arrPtr[x]; 000000013F05108C  movdqa      xmm0,xmmword ptr [__xi_z+48h (13F0521C0h)]       long long var = 0, freq, t1, t2; 000000013F051094  pxor        xmm6,xmm6           for(int x = 0; x < 1024; x++){ 000000013F051098  xor         r8d,r8d               var += arrPtr[x]; 000000013F05109B  lea         rdx,[arr]   000000013F0510A0  xor         ecx,ecx   000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]           for(int x = 0; x < 1024; x++){ 000000013F0510A8  add         r8,8               var += arrPtr[x]; 000000013F0510AC  punpckldq   xmm2,xmm2           for(int x = 0; x < 1024; x++){ 000000013F0510B0  add         rcx,20h               var += arrPtr[x]; 000000013F0510B4  movdqa      xmm3,xmm2   000000013F0510B8  pand        xmm2,xmm0   000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]   000000013F0510C1  psrad       xmm3,1Fh   000000013F0510C6  punpckldq   xmm4,xmm4   000000013F0510CA  pand        xmm3,xmm1   000000013F0510CE  por         xmm3,xmm2   000000013F0510D2  movdqa      xmm5,xmm4   000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]   000000013F0510DB  psrad       xmm5,1Fh   000000013F0510E0  punpckldq   xmm2,xmm2   000000013F0510E4  pand        xmm5,xmm1   000000013F0510E8  paddq       xmm6,xmm3   000000013F0510EC  pand        xmm4,xmm0   000000013F0510F0  movdqa      xmm3,xmm2   000000013F0510F4  por         xmm5,xmm4   000000013F0510F8  psrad       xmm3,1Fh   000000013F0510FD  movq        xmm4,mmword ptr [rdx+18h]   000000013F051102  pand        xmm3,xmm1   000000013F051106  punpckldq   xmm4,xmm4   000000013F05110A  pand        xmm2,xmm0   000000013F05110E  por         xmm3,xmm2   000000013F051112  movdqa      xmm2,xmm4   000000013F051116  paddq       xmm6,xmm5   000000013F05111A  psrad       xmm2,1Fh   000000013F05111F  pand        xmm4,xmm0   000000013F051123  pand        xmm2,xmm1           for(int x = 0; x < 1024; x++){ 000000013F051127  add         rdx,20h               var += arrPtr[x]; 000000013F05112B  paddq       xmm6,xmm3   000000013F05112F  por         xmm2,xmm4           for(int x = 0; x < 1024; x++){ 000000013F051133  cmp         r8,400h               var += arrPtr[x]; 000000013F05113A  paddq       xmm6,xmm2           for(int x = 0; x < 1024; x++){ 000000013F05113E  jb          wmain+0A2h (13F0510A2h)        for(int i = 0; i < 1024 * 1024 * 10; i++){ 000000013F051144  inc         eax   000000013F051146  cmp         eax,0A00000h   000000013F05114B  jb          wmain+98h (13F051098h)           }     }      QueryPerformanceCounter((LARGE_INTEGER*)&t2); 000000013F051151  lea         rcx,[t2]   000000013F051156  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]        printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var); 000000013F05115C  mov         r9,qword ptr [t2]       long long var = 0, freq, t1, t2; 000000013F051161  movdqa      xmm0,xmm6        printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var); 000000013F051165  sub         r9,qword ptr [t1]   000000013F05116A  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13F0521D0h)]   000000013F051171  imul        rax,r9,3E8h   000000013F051178  cqo   000000013F05117A  mov         r10,qword ptr [freq]   000000013F05117F  idiv        rax,r10       long long var = 0, freq, t1, t2; 000000013F051182  psrldq      xmm0,8        printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var); 000000013F051187  mov         rdx,rax       long long var = 0, freq, t1, t2; 000000013F05118A  paddq       xmm6,xmm0   000000013F05118E  movd        r8,xmm6        printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var); 000000013F051193  call        qword ptr [__imp_printf (13F052108h)]   

And this one is the assembly of the MSVC x64 build:

int _tmain(int argc, _TCHAR* argv[]) { 000000013FF61000  push        rbx   000000013FF61002  mov         eax,1050h   000000013FF61007  call        __chkstk (13FF61950h)   000000013FF6100C  sub         rsp,rax   000000013FF6100F  mov         rax,qword ptr [__security_cookie (13FF63000h)]   000000013FF61016  xor         rax,rsp   000000013FF61019  mov         qword ptr [rsp+1040h],rax       long long var = 0, freq, t1, t2;     std::array<int, 1024> arr;     int* arrPtr = arr.data();     QueryPerformanceFrequency((LARGE_INTEGER*)&freq); 000000013FF61021  lea         rcx,[rsp+28h]   000000013FF61026  xor         ebx,ebx   000000013FF61028  call        qword ptr [__imp_QueryPerformanceFrequency (13FF62000h)]        for(int i = 0; i < 1024; i++) arrPtr[i] = i; 000000013FF6102E  xor         r11d,r11d   000000013FF61031  lea         rax,[rsp+40h]   000000013FF61036  mov         dword ptr [rax],r11d   000000013FF61039  inc         r11d   000000013FF6103C  add         rax,4   000000013FF61040  cmp         r11d,400h   000000013FF61047  jl          wmain+36h (13FF61036h)        QueryPerformanceCounter((LARGE_INTEGER*)&t1); 000000013FF61049  lea         rcx,[rsp+20h]   000000013FF6104E  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]   000000013FF61054  mov         r11d,0A00000h   000000013FF6105A  nop         word ptr [rax+rax]        for(int i = 0; i < 1024 * 1024 * 10; i++){         for(int x = 0; x < 1024; x++){ 000000013FF61060  xor         edx,edx   000000013FF61062  xor         r8d,r8d   000000013FF61065  lea         rcx,[rsp+48h]   000000013FF6106A  xor         r9d,r9d   000000013FF6106D  mov         r10d,100h   000000013FF61073  nop         word ptr [rax+rax]               var += arrPtr[x]; 000000013FF61080  movsxd      rax,dword ptr [rcx-8]   000000013FF61084  add         rcx,10h   000000013FF61088  add         rbx,rax   000000013FF6108B  movsxd      rax,dword ptr [rcx-14h]   000000013FF6108F  add         r9,rax   000000013FF61092  movsxd      rax,dword ptr [rcx-10h]   000000013FF61096  add         r8,rax   000000013FF61099  movsxd      rax,dword ptr [rcx-0Ch]   000000013FF6109D  add         rdx,rax   000000013FF610A0  dec         r10   000000013FF610A3  jne         wmain+80h (13FF61080h)        for(int i = 0; i < 1024 * 1024 * 10; i++){         for(int x = 0; x < 1024; x++){ 000000013FF610A5  lea         rax,[rdx+r8]   000000013FF610A9  add         rax,r9   000000013FF610AC  add         rbx,rax   000000013FF610AF  dec         r11   000000013FF610B2  jne         wmain+60h (13FF61060h)           }     }      QueryPerformanceCounter((LARGE_INTEGER*)&t2); 000000013FF610B4  lea         rcx,[rsp+30h]   000000013FF610B9  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]        printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var); 000000013FF610BF  mov         rax,qword ptr [rsp+30h]   000000013FF610C4  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13FF621B0h)]   000000013FF610CB  sub         rax,qword ptr [rsp+20h]   000000013FF610D0  mov         r8,rbx   000000013FF610D3  imul        rax,rax,3E8h   000000013FF610DA  cqo   000000013FF610DC  idiv        rax,qword ptr [rsp+28h]   000000013FF610E1  mov         rdx,rax   000000013FF610E4  call        qword ptr [__imp_printf (13FF62138h)]        return 0; 000000013FF610EA  xor         eax,eax   

Intel Compiler configured without Vectorization, 64-Bit, highest optimizations (this is surprisingly slow, 12 seconds):

000000013FC0102F  lea         rcx,[freq]        double var = 0; long long freq, t1, t2; 000000013FC01034  xorps       xmm6,xmm6       std::array<double, 1024> arr;     double* arrPtr = arr.data();     QueryPerformanceFrequency((LARGE_INTEGER*)&freq); 000000013FC01037  call        qword ptr [__imp_QueryPerformanceFrequency (13FC02000h)]        for(int i = 0; i < 1024; i++) arrPtr[i] = i; 000000013FC0103D  mov         eax,2   000000013FC01042  mov         rdx,100000000h   000000013FC0104C  movd        xmm0,eax   000000013FC01050  xor         eax,eax   000000013FC01052  pshufd      xmm1,xmm0,0   000000013FC01057  movd        xmm0,rdx   000000013FC0105C  nop         dword ptr [rax]   000000013FC01060  cvtdq2pd    xmm2,xmm0   000000013FC01064  paddd       xmm0,xmm1   000000013FC01068  cvtdq2pd    xmm3,xmm0   000000013FC0106C  paddd       xmm0,xmm1   000000013FC01070  cvtdq2pd    xmm4,xmm0   000000013FC01074  paddd       xmm0,xmm1   000000013FC01078  cvtdq2pd    xmm5,xmm0   000000013FC0107C  movaps      xmmword ptr arr[rax*8],xmm2   000000013FC01081  paddd       xmm0,xmm1   000000013FC01085  movaps      xmmword ptr [rsp+rax*8+60h],xmm3   000000013FC0108A  movaps      xmmword ptr [rsp+rax*8+70h],xmm4   000000013FC0108F  movaps      xmmword ptr [rsp+rax*8+80h],xmm5   000000013FC01097  add         rax,8   000000013FC0109B  cmp         rax,400h   000000013FC010A1  jb          wmain+60h (13FC01060h)        QueryPerformanceCounter((LARGE_INTEGER*)&t1); 000000013FC010A3  lea         rcx,[t1]   000000013FC010A8  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]        for(int i = 0; i < 1024 * 1024 * 10; i++){ 000000013FC010AE  xor         eax,eax           for(int x = 0; x < 1024; x++){ 000000013FC010B0  xor         edx,edx               var += arrPtr[x]; 000000013FC010B2  lea         ecx,[rdx+rdx]           for(int x = 0; x < 1024; x++){ 000000013FC010B5  inc         edx           for(int x = 0; x < 1024; x++){ 000000013FC010B7  cmp         edx,200h               var += arrPtr[x]; 000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]   000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]           for(int x = 0; x < 1024; x++){ 000000013FC010C9  jb          wmain+0B2h (13FC010B2h)        for(int i = 0; i < 1024 * 1024 * 10; i++){ 000000013FC010CB  inc         eax   000000013FC010CD  cmp         eax,0A00000h   000000013FC010D2  jb          wmain+0B0h (13FC010B0h)           }     }      QueryPerformanceCounter((LARGE_INTEGER*)&t2); 000000013FC010D4  lea         rcx,[t2]   000000013FC010D9  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]   

Intel Compiler without vectorization, 32-Bit and highest optimization (this one clearly is the winner now, runs in about 3 seconds and the assembly looks much better):

00B81088  lea         eax,[t1]   00B8108C  push        eax   00B8108D  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]   00B81093  xor         eax,eax   00B81095  pxor        xmm0,xmm0   00B81099  movaps      xmm1,xmm0           for(int x = 0; x < 1024; x++){ 00B8109C  xor         edx,edx               var += arrPtr[x]; 00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]   00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]   00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]   00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]           for(int x = 0; x < 1024; x++){ 00B810B6  add         edx,8   00B810B9  cmp         edx,400h   00B810BF  jb          wmain+9Eh (0B8109Eh)        for(int i = 0; i < 1024 * 1024 * 10; i++){ 00B810C1  inc         eax   00B810C2  cmp         eax,0A00000h   00B810C7  jb          wmain+9Ch (0B8109Ch)        double var = 0; long long freq, t1, t2; 00B810C9  addpd       xmm0,xmm1           }     }      QueryPerformanceCounter((LARGE_INTEGER*)&t2); 00B810CD  lea         eax,[t2]   00B810D1  push        eax   00B810D2  movaps      xmmword ptr [esp+4],xmm0   00B810D7  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]   00B810DD  movaps      xmm0,xmmword ptr [esp] 
like image 496
thesaint Avatar asked Jan 19 '12 01:01

thesaint


1 Answers

tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.

Let's start with MSVC x64:

Here's the critical loop:

$LL3@main: movsxd  rax, DWORD PTR [rdx-4] movsxd  rcx, DWORD PTR [rdx-8] add rdx, 16 add r10, rax movsxd  rax, DWORD PTR [rdx-16] add rbx, rcx add r9, rax movsxd  rax, DWORD PTR [rdx-12] add r8, rax dec r11 jne SHORT $LL3@main 

What you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var variable across four registers: r10, rbx, r9, and r8. Then at the end of the loop, these 4 registers are summed up back together.

Here's where the 4 sums are recombined:

lea rax, QWORD PTR [r8+r9] add rax, r10 add rbx, rax dec rdi jne SHORT $LL6@main 

Note that MSVC currently does not do automatic vectorization.


Now let's look at part of your ICC output:

000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]   000000013F0510A8  add         r8,8   000000013F0510AC  punpckldq   xmm2,xmm2   000000013F0510B0  add         rcx,20h   000000013F0510B4  movdqa      xmm3,xmm2   000000013F0510B8  pand        xmm2,xmm0   000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]   000000013F0510C1  psrad       xmm3,1Fh   000000013F0510C6  punpckldq   xmm4,xmm4   000000013F0510CA  pand        xmm3,xmm1   000000013F0510CE  por         xmm3,xmm2   000000013F0510D2  movdqa      xmm5,xmm4   000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]   000000013F0510DB  psrad       xmm5,1Fh   000000013F0510E0  punpckldq   xmm2,xmm2   000000013F0510E4  pand        xmm5,xmm1   000000013F0510E8  paddq       xmm6,xmm3    ... 

What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.

But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.

If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:

//  Load two ints using a 64-bit load.  {x, y, 0, 0} movq        xmm2,mmword ptr arr[rcx]    //  Shuffle the data into this form. punpckldq   xmm2,xmm2           xmm2 = {x, x, y, y} movdqa      xmm3,xmm2           xmm3 = {x, x, y, y}  //  Mask out index 1 and 3. pand        xmm2,xmm0           xmm2 = {x, 0, y, 0}  //  Arithmetic right-shift to copy sign-bit across the word. psrad       xmm3,1Fh            xmm3 = {sign(x), sign(x), sign(y), sign(y)}  //  Mask out index 0 and 2. pand        xmm3,xmm1           xmm3 = {0, sign(x), 0, sign(y)}  //  Combine to get sign-extended values. por         xmm3,xmm2           xmm3 = {x, sign(x), y, sign(y)}                                 xmm3 = {x, y}  //  Add to accumulator... paddq       xmm6,xmm3 

So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.

SSE4.1 actually provides the PMOVSXDQ instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.

But the point is:

The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.


EDIT : Update with OP's results on:

  • ICC x64 no vectorization
  • ICC x86 with vectorization

You changed the data-type to double. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.

But since you disabled vectorization for the x64 version, it obviously becomes slower.

ICC x86 with vectorization:

00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]   00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]   00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]   00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]   00B810B6  add         edx,8   00B810B9  cmp         edx,400h   00B810BF  jb          wmain+9Eh (0B8109Eh)   

Not much here - standard vectorization + 4x loop-unrolling.

ICC x64 with no vectorization:

000000013FC010B2  lea         ecx,[rdx+rdx]   000000013FC010B5  inc         edx   000000013FC010B7  cmp         edx,200h   000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]   000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]   000000013FC010C9  jb          wmain+0B2h (13FC010B2h)   

No vectorization + only 2x loop-unrolling.

All things equal, disabling vectorization will hurt performance in this floating-point case.

like image 167
Mysticial Avatar answered Oct 06 '22 02:10

Mysticial