Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC highest set of instructions compatible with multiple architectures

I am running jobs on a cluster composed of machines with different architectures: gcc -march=native -Q --help=target | grep -- '-march=' | cut -f3 gives me one of these: broadwell, haswell, ivybridge, sandybridge or skylake.

The executable needs to be the same, so I cannot use -march=native but at the same time the architectures have things in common (I think they all support AVX?).

I am aware that gcc (contrary to Intel icc) does not allow for multiple archictures in a single executable. What I would like to know is if there is a way to ask gcc for the highest set of instructions compatible with all the architectures listed above.

gcc version: 8.1.1

like image 648
styko Avatar asked Jun 28 '18 08:06

styko


2 Answers

Intel hasn't ever removed instruction sets in future versions of the same CPU. i.e. a binary that works on an old Intel CPU always works on a newer Intel CPU.

(The one exception to this is first-gen Xeon Phi: Knight's Corner used an incompatible variant of AVX512 called KNI, but later Xeon Phi accelerator cards / computers use AVX512.)


If you must use the same binary on all CPUs, use gcc -march=sandybridge -mtune=haswell, and make sure your important arrays are aligned by 32 bytes.

Maybe worth benchmarking with gcc -march=sandybridge (i.e. with tune=sandybridge) as well, to see which works better for your code. -mprefer-avx128 or -mprefer-vector-width=256 might be interesting to try: some loops get messy when gcc auto-vectorizes with 256-bit vectors.


SnB/IvB have inefficient misaligned AVX loads/stores, so tune=sandybridge sets -mavx256-split-unaligned-load, which sucks a lot if your data is aligned at runtime but the compiler didn't know that. The extra instructions and shuffles aren't helpful on Haswell, so -mtune=haswell includes -mno-avx256-split-unaligned-load.

Unfortunately gcc doesn't have a "tune=avx2" option to tune for all CPUs which have AVX2, or an option to tune for the average CPU which supports the instruction sets you enabled. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568. Your only choices are tune for a specific CPU, or tune for the generic baseline, or use specific tuning options.


Gcc does has some support for runtime dispatch with ifunc

You have to activate it in the source for specific functions. See https://lwn.net/Articles/691932/ for more about function multi-versioning.


Best option: build separate binaries for SnB / Haswell, and dispatch with a script or $PATH setting

On each cluster node, create a /etc/host-type or whatever, which has sandybridge or haswell or whatever. Any per-node filesystem is fine, or re-detect it at run time with gcc or something cheaper. In your job script:

#!/bin/sh

bin_dir="./bin-$(</etc/node-type)"
exec "$bin_dir/my_prog"  "$@"

Create symlinks as necessary to make bin-skylake and bin-broadwell use the Haswell binaries.

Haswell introduced AVX2 and FMA, and BMI1/2. If you're number-crunching, you really want FMA. BDW/SKL didn't introduce any significant ISA extensions that compilers can use to make your code run faster. Tuning for BDW/SKL is not different either.

If you have any Skylake-avx512 CPUs, that's different.

like image 148
Peter Cordes Avatar answered Sep 28 '22 08:09

Peter Cordes


Comments suggested me to look by myself at the 'intersection' between the architectures. The following bash script seems to do the job.

#!/usr/bin/env bash

archs=("broadwell" "haswell" "ivybridge" "sandybridge" "skylake")

for ar in ${archs[@]}; do
    gcc -march=$ar -Q --help=target | grep -- "  -m" > "$ar.log"
done

cp "${archs[0]}.log" all.log
for ar in ${archs[@]:1}; do
    join all.log "$ar.log" > tmp.log
    mv tmp.log all.log
done

cat all.log | grep "\[activé]" | grep -v "\[désactivé]" | cut -d' ' -f1 | tr '\n' ' '

(Computer in French: "activé" => "enabled", "désactivé" => "disabled")

The output is

-m128bit-long-double -m64 -m80387 -maes -malign-stringops -mavx -mcx16 -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mhard-float -mieee-fp -mlong-double-80 -mmmx -mpclmul -mpopcnt -mpush-args -mred-zone -msahf -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2 -mssse3 -mstv -mtls-direct-seg-refs -mvzeroupper -mxsave -mxsaveopt

As I expected all the architectures support both SSE and AVX.

like image 31
styko Avatar answered Sep 28 '22 08:09

styko