I am working on a C library which compiles/links to a .a
file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a
file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a
file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
if
statement could be significant.The fastest solution I've come up with so far is to do the following:
cpuid
instruction.true
or false
depending on the result.I'm not satisfied with this approach because it has two drawbacks:
cpuid
and set a global variable at the beginning of the program, given that I'm distributing a .a
file and don't have control over the main
function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
Are there any solutions that are more efficient than the one I've detailed above?
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb
is slow on some early CPUs that support it.
If your functions depend on pdep
/ pext
, you probably want to detect AMD vs. Intel, because AMD's pdep
/pext
is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr]
instead of call func
. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy
implementation.)
But with static linking for a .a
, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With