Most of the hardware I uses supports SSE2 these days. On Windows and Linux, I have some code to test SSE support. I read somewhere that macOS has supported SSE for a long time, but I don't know the minimum version that can be enabled. The final binary will be copied to other macOS platforms so I cannot use -march=native
like with GCC.
If it is enabled by default on all builds, do I have to pass -msse
or -msse2
flags when building my code ?
Here is my compiler version:
Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix
Here is the output of uname -a
uname -a
Darwin mme.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64
Here is the output of sysctl machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 POPCNT
SSE2 is enabled by default for x86-64, because it's a required part of the x86-64 ISA.
Since Apple has never sold any AMD or Pentium4 CPUs, x86-64 on OS X also implies SSSE3 (first-gen Core2). The first x86 Macs were Core (not Core2), but they were 32-bit only. You unfortunately can't assume SSE4.1 or -mpopcnt
.
I'd suggest -march=core2 -mtune=haswell
. (-mtune
doesn't affect compatibility, and Haswell tuning shouldn't be bad for actual Core2 or Nehalem hardware. See http://agner.org/optimize/ and links in the x86 tag wiki for microarchitecture details about what things in (compiler-generated) assembly language are fast or slow on different CPUs.).
(See How does mtune actually work? for an example of different tuning causing different instruction selection without changing the required ISA extensions.)
-march=core2
enables everything that core2 supports, not just SSSE3. Since you don't care about your code performing well on AMD CPUs (because it's OS X), you can tune for an Intel CPU. There's also -mtune=intel
which is more generic, but Haswell should be reasonable.
You might be missing out on support for Hackintosh systems where someone installed OS X on an ancient CPU on non-Apple hardware, but IDK if OS X would work on an AMD Athlon64 / PhenomII, or Intel P4.
It would be nice to be able to enable some Nehalem stuff like -mpopcnt
, but Core 2 first and 2nd gen (Conroe and Penryn) lacked that. Even SSE4.1 isn't available on first-gen Core 2.
It's also possible to build a fat binary with baseline and Haswell slices, x86_64
and x86_64h
. Stephen Canon says (in comments below) that "the x86_64h slice will run automatically on Haswell and later µarches". (Slices for other uarches aren't currently an option, but most programs would get little benefit.)
Your x86_64
(non-Haswell) slice should probably build with -march=core2 -mtune=sandybridge
.
Haswell introduced AVX2, FMA, and BMI2, so -march=haswell
is a very nice for Broadwell / Skylake / Kaby Lake / Coffee Lake. (For tuning options as well as ISA extensions: gcc -march=haswell
disables -mavx256-split-unaligned-load
and store, while -mavx
+ tune=default or sandybridge enables it. It sucks on Haswell especially when it creates shuffle-port bottlenecks. And it's really dumb when your data is almost always aligned, or really always but you just didn't tell the compiler about it.
Broadwell introduced ADOX/ADCX which is pretty niche (run two extended-precision add dependency chains in parallel), and Skylake introduced clflushopt
which isn't widely useful.
Skylake and most Broadwell CPUs do have working transactional memory, though, which might be important for some fine-grained multithreading cases. (Haswell was going to have it, but it was disabled in a microcode update after a rare bug was discovered in the implementation.)
AVX512 is the next big thing that's widely useful but Haswell doesn't have, so maybe Apple will add support for a Cannonlake or Ice Lake slice at some point.
I wouldn't recommend making a separate build for Broadwell or Skylake (with any dispatching mechanism), unless you know you can take advantage of a specific new feature and it makes a significant difference.
But it could be potentially useful for Sandybridge, for AVX support without AVX2, especially for 256-bit FP math but also to save movdqa
instructions in integer 128-bit vector code. Also for SSE4.x and popcnt. And no partial-flag problems in an extended-precision adc
loop using dec/jnz
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With