Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Half-precision floating-point arithmetic on Intel chips

Is it possible to perform half-precision floating-point arithmetic on Intel chips?

I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision floating-point numbers.

[1] https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

like image 473
Kadir Avatar asked Apr 24 '18 07:04

Kadir


2 Answers

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture - has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info.

Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from single and dot product accumulating pairs into single-precision.

This also applies to Alder Lake, on systems with the E cores disabled and AVX-512 specifically enabled in the BIOS (which apparently isn't officially supported as of now; only some mobo vendors have options for this.)

(The rest of the answer isn't updated for Sapphire Rapids / Alder Lake having FP16 / BF16.)


With the on-chip GPU

Is it possible to perform half-precision floating-point arithmetic on Intel chips?

Yes, apparently the on-chip GPU in Skylake and later has hardware support for FP16 and FP64, as well as FP32. With new enough drivers you can use it via OpenCL.

On earlier chips you get about the same throughput for FP16 vs. FP32 (probably just converting on the fly for nearly free), but on SKL / KBL chips you get about double the throughput of FP32 for GPGPU Mandelbrot (note the log-scale on the Mpix/s axis of the chart in that link).

The gain in FP64 (double) performance was huge, too, on Skylake iGPU.


With AVX / AVX-512 instructions

But on the IA cores (Intel-Architecture) no; even with AVX512 there's no hardware support for anything but converting them to single-precision. This saves memory bandwidth and can certainly give you a speedup if your code bottlenecks on memory. But it doesn't gain in peak FLOPS for code that's not bottlenecked on memory.

You could of course implement software floating point, possibly even in SIMD registers, so technically the answer is still "yes" to the question you asked, but it won't be faster than using the F16C VCVTPH2PS / VCVTPS2PH instructions + packed-single vmulps / vfmadd132ps HW support.

Use HW-supported SIMD conversion to/from float / __m256 in x86 code to trade extra ALU conversion work for reduced memory bandwidth and cache footprint. But if cache-blocking (e.g. for well-tuned dense matmul) or very high computational intensity means you're not memory bottlenecked, then just use float and save on ALU operations.


Upcoming: bfloat16 (Brain Float) and AVX512 BF16

A new 16-bit FP format with the same exponent range as IEEE binary32 has been developed for neural network use-cases. Compared to IEEE binary16 like x86 F16C conversion instructions use, it has much less significand precision, but apparently neural network code cares more about dynamic range from a large exponent range. This allows bfloat hardware not to even bother supporting subnormals.

Some upcoming Intel x86 CPU cores are will have HW support this format. The main use-case is still dedicated neural network accelerators (Nervana) and GPGPU type devices, but HW-supported conversion at least is very useful.

https://en.wikichip.org/wiki/brain_floating-point_format has more details, specifically that Cooper Lake Xeon and Core X CPUs are expected to support AVX512 BF16.

I haven't seen it mentioned for Ice Lake (Sunny Cove microarch). That could go either way, I wouldn't care to guess.

Intel® Architecture Instruction Set Extensions and Future Features Programming Reference revision -036 in April 2019 added details about BF16, including that it's slated for "Future, Cooper Lake". Once it's released, the documentation for the instructions will move to the main vol.2 ISA ref manual (and the pdf->HTML scrape at https://www.felixcloutier.com/x86/index.html).

https://github.com/HJLebbink/asm-dude/wiki has instructions from vol.2 and the future-extensions manual, so you can already find it there.

There are only 3 instructions: conversion to/from float, and a BF16 multiply + pairwise-accumulate into float. (First horizontal step of a dot-product.) So AVX512 BF16 does finally provide true computation for 16-bit floating point, but only in this very limited form that converts the result to float.

They also ignore MXCSR, always using the default rounding mode and DAZ/FTZ, and not setting any exception flags.

  • VCVTNEPS2BF16 [xxy]mm1{k1}{z}, [xyz]mm2/m512/m32bcst
    ConVerT (No Exceptions) Packed Single 2(to) BF16
    __m256bh _mm512_cvtneps_pbh (__m512);

The other two don't support memory fault-suppression (when using masking with a memory source operand). Presumably because the masking is per destination element, and there are a different number of source elements. Conversion to BF16 apparently can suppress memory faults, because the same mask can apply to the 32-bit source elements as the 16-bit destination elements.

  • VCVTNE2PS2BF16 [xyz]mm1{k1}{z}, [xyz]mm2, [xyz]mm3/m512/m32bcst
    ConVerT (No Exceptions) 2 registers of Packed Single 2(to) BF16.
    _m512bh _mm512_cvtne2ps_pbh (__m512, __m512);

  • VDPBF16PS [xyz]mm1{k1}{z}, [xyz]mm2, [xyz]mm3/m512/m32bcst
    Dot Product of BF16 Pairs Accumulated into Packed Single Precision
    __m512 _mm512_dpbf16_ps(__m512, __m512bh, __m512bh); (Notice that even the unmasked version has a 3rd input for the destination accumulator, like an FMA).

      # the key part of the Operation section:
      t ← src2.dword[ i ]  (or  src.dword[0] for a broadcast memory source)
      srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+1]) * make_fp32(t.bfloat[1])
      srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+0]) * make_fp32(t.bfloat[0])
    

So we still don't get native 16-bit FP math that you can use for arbitrary things while keeping your data in 16-bit format for 32 elements per vector. Only FMA into 32-bit accumulators.


BTW, there are other real-number formats that aren't based on the IEEE-754 structure of fixed-width fields for sign/exponent/significand. One that's gaining popularity is Posit. https://en.wikipedia.org/wiki/Unum_(number_format), Beating Floating Point at its Own Game: Posit Arithmetic, and https://posithub.org/about

Instead of spending the whole significand coding space on NaNs, they use it for tapered / gradual overflow, supporting larger range. (And removing NaN simplifies the HW). IEEE floats only support gradual underflow (with subnormals), with hard overflow to +-Inf. (Which is usually an error/problem in real numerical simulations, not much different from NaN.)

The Posit encoding is sort of a variable width exponent, leaving more precision near 1.0. The goal is to allow using 32-bit or 16-bit precision in more cases (instead of 64 or 32) while still getting useful results for scientific computing / HPC, such as climate modeling. Double the work per SIMD vector, and half the memory bandwidth.

There have been some paper designs for Posit FPU hardware, but it's still early days yet and I think only FPGA implementations have really been built. Some Intel CPUs will come with onboard FPGAs (or maybe that's already a thing).

As of mid-2019 I haven't read about any Posit execution units as part of a commercial CPU design, and google didn't find anything.

like image 135
Peter Cordes Avatar answered Nov 13 '22 09:11

Peter Cordes


If you are using all cores I would think that in many cases you are still memory bandwidth bound and half precision floating points would be a win.

like image 41
Avatar Avatar answered Nov 13 '22 10:11

Avatar