I couldn't find much use of SIMD instructions (like SSE/AVX) in kernel (except one place where they were used to speedup parity computation of RAID6).
Q1) Any specific reason for this or just the lack of use-case?
Q2) What needs to be done today if I want to use SIMD instruction, in say a device driver?
Q3) How hard will it be to incorporate framework like ISPC into kernel (just for experimentation)?
Capable of processing multiple data with a single instruction, SIMD operations are widely used for 3D graphics and audio/video processing in multimedia applications. A number of recently developed processors have instructions for SIMD operations (hereinafter referred to as SIMD instructions).
One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there's a high chance your smartphone has it too, through the architecture extension ARM NEON.
Saving/restoring FPU (including SIMD vector registers) state is more expensive than just integer GP register state. It's simply not worth the cost in most cases.
In Linux kernel code, all you have to do is call kernel_fpu_begin()
/ kernel_fpu_end()
around your code. This is what the RAID drivers do. See http://yarchive.net/comp/linux/kernel_fp.html.
x86 doesn't have any future-proof way to save/restore one or a couple vector registers. (Other than manual save/restore of an xmm
register using legacy SSE instructions, potentially causing SSE/AVX transition stalls on Intel CPUs if user-space had the upper halves of any ymm/zmm registers dirty).
The reason legacy SSE works is that some Windows drivers were already doing this when Intel wanted to introduce AVX, so they invented that transition-penalty stuff instead of having legacy SSE instructions zero the upper 128b of ymm registers. (See this for more detail on that design decision.) So basically we can blame Windows binary-only drivers for the SSE/AVX transition-penalty mess.
IDK about non-x86 architectures, and whether the existing SIMD instruction sets have a future-proof way to save/restore a register that will continue to work for longer vectors. ARM32 might, if extensions continue the pattern of using multiple 32-bit FP registers as single wider register. (e.g. q2
is composed of s8
through s11
.) So saving/restoring a couple q
registers should be future-proof, if a 256b NEON extension simply lets you use 2 q
registers as one 256b register. Or if the new wider vectors are separate, and don't extend the existing registers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With