_ftol2_sse, are there faster options?

Question

I have code which calls a lot of

int myNumber = (int)(floatNumber);

which takes up, in total, around 10% of my CPU time (according to profiler). While I could leave it at that, I wonder if there are faster options, so I tried searching around, and stumbled upon

http://devmaster.net/forums/topic/7804-fast-int-float-conversion-routines/ http://stereopsis.com/FPU.html

I tried implementing the Real2Int() function given there, but it gives me wrong results, and runs slower. Now I wonder, are there faster implementations to floor double / float values to integers, or is the SSE2 version as fast as it gets? The pages I found date back a bit, so it might just be outdated, and newer STL is faster at this.

The current implementation does:

013B1030  call        _ftol2_sse (13B19A0h)

013B19A0  cmp         dword ptr [___sse2_available (13B3378h)],0  
013B19A7  je          _ftol2 (13B19D6h)  
013B19A9  push        ebp  
013B19AA  mov         ebp,esp  
013B19AC  sub         esp,8  
013B19AF  and         esp,0FFFFFFF8h  
013B19B2  fstp        qword ptr [esp]  
013B19B5  cvttsd2si   eax,mmword ptr [esp]  
013B19BA  leave  
013B19BB  ret

Related questions I found:

Fast float to int conversion and floating point precision on ARM (iPhone 3GS/4)

What is the fastest way to convert float to int on x86

Since both are old, or are ARM based, I wonder if there are current ways to do this. Note that it says the best conversion is one that doesn't happen, but I need to have it, so that will not be possible.

David Heffernan · Accepted Answer

It's going to be hard to beat that if you are targeting generic x86 hardware. The runtime doesn't know for sure that the target machine has an SSE unit. If it did, it could do what the x64 compiler does and inline a cvttss2si opcode. But since the runtime has to check whether an SSE unit is available, you are left with the current implementation. That's what the implementation of ftol2_sse does. And what's more it passes the value in an x87 register and then transfers it to an SSE register if an SSE unit is available.

You could tell the x86 compiler to target machines that have SSE units. Then the compiler would indeed emit a simple cvttss2si opcode inline. That's going to be as fast as you can get. But if you run the code on an older machine then it will fail. Perhaps you could supply two versions, one for machines with SSE, and one for those without.

That's not going to gain you all that much. It's just going to avoid all the overhead of ftol2_sse that happens before you actually reach the cvttss2si opcode that does the work.

To change the compiler settings from the IDE, use Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set. On the command line it is /arch:SSE or /arch:SSE2.

_ftol2_sse, are there faster options?

Tags:

c++

floating-point

SinisterMJ

1 Answers

David Heffernan

Recent Activity

Donate For Us