Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compute sincos fast on a x64 CPU?

This is a question addresed to users, experienced in SSE/AVX instruction family, and those of them, who are familiar with its performance analysis. I saw a lot of different implementations and approaches, ranging from older for SSE2 to newer ones. Web is flooded with such a links. But personally i am not deeply experienced in sse assembly analyze. Some people are pointing out to the uops, caches, and that requires some low level knowledge. So i am asking for an hints and your personal experiences. If you have some time to roll out some comparison, on "What is fastest" and why, what approaches you looked at. Implementation maybe not so precise, 10-16 bits of single FP precision is good enough. More is better, but when it does not affect speed.

PS. To try to avoid meta flood, i could describe task precisely with details:

  • Given scalar argument x (in radians), that is passed in xmm register (according to x64 fastcall convention).
  • Write a function with signature __m128 sincos(float x); that returns its sin(x) and cos(x) values approximations.
  • Return value should be inside one xmm register and to be calculated in a fastest possible manner, to satisfy 10-bit precision requirement.
  • Argument could be any real number (but not nan, inf, so on). In case if argument normalisation is required by approach its performant implementation(fmod()) would be also the subject. But question is not about handling special FP cases.

This may be a duplicate, but i have failed to find similar question here, so please point me, if there is already one.

like image 713
xakepp35 Avatar asked Feb 25 '18 08:02

xakepp35


1 Answers

I have discovered great modern revision of Julien Pommier implementations, ported for AVX/AVX2 under zlib, thanks to Giovanni Garberoglio:

http://software-lisc.fbk.eu/avx_mathfun/

It works really fast, 80-90M iterations per second on single core of i7 3770k, giving 8 sines and 8 coses per iteration. compared to ~15Mhz if i call 8 sinf() and 8 cosf() per iteration (functions from msvc2017 x64 library, with avx compiler settings)


UPD: Also there is an excellent FastTrigo code samples, where FT::sincos() function is 20% faster than Julien Pommier's implementation. And his FT::sincos() provides exactly 10 bit of guranteed accuracy.

like image 114
xakepp35 Avatar answered Nov 16 '22 06:11

xakepp35