I have seen questions about how to use FMA instructions set but before I get to start using them, I'd first like to know if I can (does my processor support them). I found a post saying that I needed to look at the output of (working on Linux):
more /proc/cpuinfo
to find out. I get this:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 30
model name : Intel(R) Xeon(R) CPU X3470 @ 2.93GHz
stepping : 5
cpu MHz : 2933.235
size : 8192 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5866.47
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
What seems the most interesting is the flags part but I am not sure how to find out from that list if the processor supports these instructions.
Does anybody know how to find that out? Thank you.
I assume you want to detect it in C/C++ at compile-time.
FP_FAST_FMA
macro is not a reliable way to detect FMA instruction set. This macro is defined in "math.h"
/<cmath>
if std::fma
is faster than x*y+z
, which is possible if it's an intrinsic function based on an FMA instruction set. Otherwise it will use a non-intrinsic function which is very slow. Now in 2016 GCC's default glibc/libstdc++ defines this macro, but most other standard library implementations don't (including LLVM libc++, ICC's and MSVC's). It doesn't mean that they don't implement std::fma
as an intrinsic if possible, they just forgot to define this macro.
Reliable FMA detection
To reliably detect FMA (or any instruction set) at compile time you need to use instruction set specific macros. These macros are defined by the compiler based on the selected target architecture and/or instruction sets.
There is an __FMA__
macro for FMA/FMA3 support, and __FMA4__
macro for AMD FMA4 support. GCC, clang and ICC do define them.
Unfortunately MSVC doesn't define any instruction set specific macros other than __AVX__
and __AVX2__
.
Cross-compiler FMA detection
For Intel processors FMA were introduced with AVX2 by Intel Haswell.
For AMD processors, the thing is a little bit messy. FMA4 were introduced with AVX and XOP by AMD Bulldozer. FMA3 (Intel FMA equivalent) were introduced by AMD Piledriver. You can distinguish Piledriver from its predecessor Bulldozer at compile time by the presence of FMA (__FMA__
macro) and BMI (__BMI__
macro) instruction sets. Unfortunately MSVC doesn't define neither.
Nevertheless, like Intel processors, all AMD processors support FMA/FMA3 if AVX2 is present.
If you want cross-compiler detection whether the target architecture supports FMA/FMA3, you must detect the __AVX2__
macro, since it is defined by all major compilers (including MSVC) if AVX2 is enabled:
#if !defined(__FMA__) && defined(__AVX2__)
#define __FMA__ 1
#endif
Unfortunately there is no reliable way to detect AMD FMA4 using only __AVX__
and __AVX2__
macros.
Notes
FMA instructions are actually available in your program only if it's enabled by the compiler. In GCC and clang you need to set the proper target architecture (like -march=haswell
) or manually enable the FMA instruction set with -mfma
flag. ICC enables FMA automatically with the -xavx2
flag. MSVC enables FMA with the /arch:AVX2 /fp:fast /O2
options.
AMD announced that it will drop support of FMA4 in the future.
Yes, if you have it, it will appear under the flags
part. On an Intel Haswell machine I get
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm
and on an AMD Piledriver, I get
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1
(note that it includes an fma4
flag, as well as the standard fma
flag).
So an easy way to check on Linux is to look at the return code of:
grep fma < /proc/cpuinfo
OS X doesn't have /proc/cpuinfo
, but you can instead do:
sysctl -n hw.optional.fma
which will print 0 (no fma) or 1 (has fma).
If you're using C/C++, you can also use the FP_FAST_FMA
macro.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With