Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the minimum supported SSE flag that can be enabled on macOS?

Most of the hardware I uses supports SSE2 these days. On Windows and Linux, I have some code to test SSE support. I read somewhere that macOS has supported SSE for a long time, but I don't know the minimum version that can be enabled. The final binary will be copied to other macOS platforms so I cannot use -march=native like with GCC.

If it is enabled by default on all builds, do I have to pass -msse or -msse2 flags when building my code ?

Here is my compiler version:

Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix

Here is the output of uname -a

uname -a
Darwin mme.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64

Here is the output of sysctl machdep.cpu.features

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 POPCNT
like image 682
rkm Avatar asked Aug 28 '17 10:08

rkm


1 Answers

SSE2 is enabled by default for x86-64, because it's a required part of the x86-64 ISA.

Since Apple has never sold any AMD or Pentium4 CPUs, x86-64 on OS X also implies SSSE3 (first-gen Core2). The first x86 Macs were Core (not Core2), but they were 32-bit only. You unfortunately can't assume SSE4.1 or -mpopcnt.

I'd suggest -march=core2 -mtune=haswell. (-mtune doesn't affect compatibility, and Haswell tuning shouldn't be bad for actual Core2 or Nehalem hardware. See http://agner.org/optimize/ and links in the x86 tag wiki for microarchitecture details about what things in (compiler-generated) assembly language are fast or slow on different CPUs.).

(See How does mtune actually work? for an example of different tuning causing different instruction selection without changing the required ISA extensions.)

-march=core2 enables everything that core2 supports, not just SSSE3. Since you don't care about your code performing well on AMD CPUs (because it's OS X), you can tune for an Intel CPU. There's also -mtune=intel which is more generic, but Haswell should be reasonable.

You might be missing out on support for Hackintosh systems where someone installed OS X on an ancient CPU on non-Apple hardware, but IDK if OS X would work on an AMD Athlon64 / PhenomII, or Intel P4.

It would be nice to be able to enable some Nehalem stuff like -mpopcnt, but Core 2 first and 2nd gen (Conroe and Penryn) lacked that. Even SSE4.1 isn't available on first-gen Core 2.


It's also possible to build a fat binary with baseline and Haswell slices, x86_64 and x86_64h. Stephen Canon says (in comments below) that "the x86_64h slice will run automatically on Haswell and later µarches". (Slices for other uarches aren't currently an option, but most programs would get little benefit.)

Your x86_64 (non-Haswell) slice should probably build with -march=core2 -mtune=sandybridge.

Haswell introduced AVX2, FMA, and BMI2, so -march=haswell is a very nice for Broadwell / Skylake / Kaby Lake / Coffee Lake. (For tuning options as well as ISA extensions: gcc -march=haswell disables -mavx256-split-unaligned-load and store, while -mavx + tune=default or sandybridge enables it. It sucks on Haswell especially when it creates shuffle-port bottlenecks. And it's really dumb when your data is almost always aligned, or really always but you just didn't tell the compiler about it.

Broadwell introduced ADOX/ADCX which is pretty niche (run two extended-precision add dependency chains in parallel), and Skylake introduced clflushopt which isn't widely useful.

Skylake and most Broadwell CPUs do have working transactional memory, though, which might be important for some fine-grained multithreading cases. (Haswell was going to have it, but it was disabled in a microcode update after a rare bug was discovered in the implementation.)

AVX512 is the next big thing that's widely useful but Haswell doesn't have, so maybe Apple will add support for a Cannonlake or Ice Lake slice at some point.

I wouldn't recommend making a separate build for Broadwell or Skylake (with any dispatching mechanism), unless you know you can take advantage of a specific new feature and it makes a significant difference.

But it could be potentially useful for Sandybridge, for AVX support without AVX2, especially for 256-bit FP math but also to save movdqa instructions in integer 128-bit vector code. Also for SSE4.x and popcnt. And no partial-flag problems in an extended-precision adc loop using dec/jnz.

like image 78
Peter Cordes Avatar answered Oct 03 '22 19:10

Peter Cordes