I am running jobs on a cluster composed of machines with different architectures:
gcc -march=native -Q --help=target | grep -- '-march=' | cut -f3
gives me one of these: broadwell
, haswell
, ivybridge
, sandybridge
or skylake
.
The executable needs to be the same, so I cannot use -march=native
but at the same time the architectures have things in common (I think they all support AVX?).
I am aware that gcc
(contrary to Intel icc
) does not allow for multiple archictures in a single executable. What I would like to know is if there is a way to ask gcc
for the highest set of instructions compatible with all the architectures listed above.
gcc version: 8.1.1
Intel hasn't ever removed instruction sets in future versions of the same CPU. i.e. a binary that works on an old Intel CPU always works on a newer Intel CPU.
(The one exception to this is first-gen Xeon Phi: Knight's Corner used an incompatible variant of AVX512 called KNI, but later Xeon Phi accelerator cards / computers use AVX512.)
If you must use the same binary on all CPUs, use gcc -march=sandybridge -mtune=haswell
, and make sure your important arrays are aligned by 32 bytes.
Maybe worth benchmarking with gcc -march=sandybridge
(i.e. with tune=sandybridge) as well, to see which works better for your code. -mprefer-avx128
or -mprefer-vector-width=256
might be interesting to try: some loops get messy when gcc auto-vectorizes with 256-bit vectors.
SnB/IvB have inefficient misaligned AVX loads/stores, so tune=sandybridge sets -mavx256-split-unaligned-load
, which sucks a lot if your data is aligned at runtime but the compiler didn't know that. The extra instructions and shuffles aren't helpful on Haswell, so -mtune=haswell
includes -mno-avx256-split-unaligned-load
.
Unfortunately gcc doesn't have a "tune=avx2" option to tune for all CPUs which have AVX2, or an option to tune for the average CPU which supports the instruction sets you enabled. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568. Your only choices are tune for a specific CPU, or tune for the generic baseline, or use specific tuning options.
ifunc
You have to activate it in the source for specific functions. See https://lwn.net/Articles/691932/ for more about function multi-versioning.
$PATH
settingOn each cluster node, create a /etc/host-type
or whatever, which has sandybridge
or haswell
or whatever. Any per-node filesystem is fine, or re-detect it at run time with gcc
or something cheaper. In your job script:
#!/bin/sh
bin_dir="./bin-$(</etc/node-type)"
exec "$bin_dir/my_prog" "$@"
Create symlinks as necessary to make bin-skylake
and bin-broadwell
use the Haswell binaries.
Haswell introduced AVX2 and FMA, and BMI1/2. If you're number-crunching, you really want FMA. BDW/SKL didn't introduce any significant ISA extensions that compilers can use to make your code run faster. Tuning for BDW/SKL is not different either.
If you have any Skylake-avx512 CPUs, that's different.
Comments suggested me to look by myself at the 'intersection' between the architectures. The following bash script seems to do the job.
#!/usr/bin/env bash
archs=("broadwell" "haswell" "ivybridge" "sandybridge" "skylake")
for ar in ${archs[@]}; do
gcc -march=$ar -Q --help=target | grep -- " -m" > "$ar.log"
done
cp "${archs[0]}.log" all.log
for ar in ${archs[@]:1}; do
join all.log "$ar.log" > tmp.log
mv tmp.log all.log
done
cat all.log | grep "\[activé]" | grep -v "\[désactivé]" | cut -d' ' -f1 | tr '\n' ' '
(Computer in French: "activé" => "enabled", "désactivé" => "disabled")
The output is
-m128bit-long-double -m64 -m80387 -maes -malign-stringops -mavx -mcx16 -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mhard-float -mieee-fp -mlong-double-80 -mmmx -mpclmul -mpopcnt -mpush-args -mred-zone -msahf -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2 -mssse3 -mstv -mtls-direct-seg-refs -mvzeroupper -mxsave -mxsaveopt
As I expected all the architectures support both SSE and AVX.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With