GCC highest set of instructions compatible with multiple architectures

Question

I am running jobs on a cluster composed of machines with different architectures: gcc -march=native -Q --help=target | grep -- '-march=' | cut -f3 gives me one of these: broadwell, haswell, ivybridge, sandybridge or skylake.

The executable needs to be the same, so I cannot use -march=native but at the same time the architectures have things in common (I think they all support AVX?).

I am aware that gcc (contrary to Intel icc) does not allow for multiple archictures in a single executable. What I would like to know is if there is a way to ask gcc for the highest set of instructions compatible with all the architectures listed above.

gcc version: 8.1.1

Peter Cordes · Accepted Answer

Intel hasn't ever removed instruction sets in future versions of the same CPU. i.e. a binary that works on an old Intel CPU always works on a newer Intel CPU.

(The one exception to this is first-gen Xeon Phi: Knight's Corner used an incompatible variant of AVX512 called KNI, but later Xeon Phi accelerator cards / computers use AVX512.)

If you must use the same binary on all CPUs, use gcc -march=sandybridge -mtune=haswell, and make sure your important arrays are aligned by 32 bytes.

Maybe worth benchmarking with gcc -march=sandybridge (i.e. with tune=sandybridge) as well, to see which works better for your code. -mprefer-avx128 or -mprefer-vector-width=256 might be interesting to try: some loops get messy when gcc auto-vectorizes with 256-bit vectors.

SnB/IvB have inefficient misaligned AVX loads/stores, so tune=sandybridge sets -mavx256-split-unaligned-load, which sucks a lot if your data is aligned at runtime but the compiler didn't know that. The extra instructions and shuffles aren't helpful on Haswell, so -mtune=haswell includes -mno-avx256-split-unaligned-load.

Unfortunately gcc doesn't have a "tune=avx2" option to tune for all CPUs which have AVX2, or an option to tune for the average CPU which supports the instruction sets you enabled. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568. Your only choices are tune for a specific CPU, or tune for the generic baseline, or use specific tuning options.

Gcc does has some support for runtime dispatch with `ifunc`

You have to activate it in the source for specific functions. See https://lwn.net/Articles/691932/ for more about function multi-versioning.

Best option: build separate binaries for SnB / Haswell, and dispatch with a script or `$PATH` setting

On each cluster node, create a /etc/host-type or whatever, which has sandybridge or haswell or whatever. Any per-node filesystem is fine, or re-detect it at run time with gcc or something cheaper. In your job script:

#!/bin/sh

bin_dir="./bin-$(</etc/node-type)"
exec "$bin_dir/my_prog"  "$@"

Create symlinks as necessary to make bin-skylake and bin-broadwell use the Haswell binaries.

Haswell introduced AVX2 and FMA, and BMI1/2. If you're number-crunching, you really want FMA. BDW/SKL didn't introduce any significant ISA extensions that compilers can use to make your code run faster. Tuning for BDW/SKL is not different either.

If you have any Skylake-avx512 CPUs, that's different.

styko · Answer

Comments suggested me to look by myself at the 'intersection' between the architectures. The following bash script seems to do the job.

#!/usr/bin/env bash

archs=("broadwell" "haswell" "ivybridge" "sandybridge" "skylake")

for ar in ${archs[@]}; do
    gcc -march=$ar -Q --help=target | grep -- "  -m" > "$ar.log"
done

cp "${archs[0]}.log" all.log
for ar in ${archs[@]:1}; do
    join all.log "$ar.log" > tmp.log
    mv tmp.log all.log
done

cat all.log | grep "\[activé]" | grep -v "\[désactivé]" | cut -d' ' -f1 | tr '\n' ' '

(Computer in French: "activé" => "enabled", "désactivé" => "disabled")

The output is

-m128bit-long-double -m64 -m80387 -maes -malign-stringops -mavx -mcx16 -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mhard-float -mieee-fp -mlong-double-80 -mmmx -mpclmul -mpopcnt -mpush-args -mred-zone -msahf -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2 -mssse3 -mstv -mtls-direct-seg-refs -mvzeroupper -mxsave -mxsaveopt

As I expected all the architectures support both SSE and AVX.

GCC highest set of instructions compatible with multiple architectures

Tags:

cpu-architecture

c

gcc

styko

2 Answers

Gcc does has some support for runtime dispatch with `ifunc`

Best option: build separate binaries for SnB / Haswell, and dispatch with a script or `$PATH` setting

Peter Cordes

styko

Recent Activity

Donate For Us

GCC highest set of instructions compatible with multiple architectures

Tags:

cpu-architecture

c

gcc

styko

2 Answers

Gcc does has some support for runtime dispatch with ifunc

Best option: build separate binaries for SnB / Haswell, and dispatch with a script or $PATH setting

Peter Cordes

styko

Related questions

Recent Activity

Donate For Us

Gcc does has some support for runtime dispatch with `ifunc`

Best option: build separate binaries for SnB / Haswell, and dispatch with a script or `$PATH` setting