This is a question addresed to users, experienced in SSE/AVX instruction family, and those of them, who are familiar with its performance analysis. I saw a lot of different implementations and approaches, ranging from older for SSE2 to newer ones. Web is flooded with such a links. But personally i am not deeply experienced in sse assembly analyze. Some people are pointing out to the uops, caches, and that requires some low level knowledge. So i am asking for an hints and your personal experiences. If you have some time to roll out some comparison, on "What is fastest" and why, what approaches you looked at. Implementation maybe not so precise, 10-16 bits of single FP precision is good enough. More is better, but when it does not affect speed.
PS. To try to avoid meta flood, i could describe task precisely with details:
__m128 sincos(float x)
; that returns its sin(x) and cos(x) values approximations. nan
, inf
, so on). In case if argument normalisation is required by approach its performant implementation(fmod()) would be also the subject. But question is not about handling special FP cases.This may be a duplicate, but i have failed to find similar question here, so please point me, if there is already one.
I have discovered great modern revision of Julien Pommier implementations, ported for AVX/AVX2 under zlib, thanks to Giovanni Garberoglio:
http://software-lisc.fbk.eu/avx_mathfun/
It works really fast, 80-90M iterations per second on single core of i7 3770k, giving 8 sines and 8 coses per iteration. compared to ~15Mhz if i call 8 sinf() and 8 cosf() per iteration (functions from msvc2017 x64 library, with avx compiler settings)
UPD:
Also there is an excellent FastTrigo code samples, where FT::sincos()
function is 20% faster than Julien Pommier's implementation. And his FT::sincos()
provides exactly 10 bit of guranteed accuracy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With