Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will fixed-point arithmetic be worth my trouble?

I'm working on a fluid dynamics Navier-Stokes solver that should run in real time. Hence, performance is important.

Right now, I'm looking at a number of tight loops that each account for a significant fraction of the execution time: there is no single bottleneck. Most of these loops do some floating-point arithmetic, but there's a lot of branching in between.

The floating-point operations are mostly limited to additions, subtractions, multiplications, divisions and comparisons. All this is done using 32-bit floats. My target platform is x86 with at least SSE1 instructions. (I've verified in the assembler output that the compiler indeed generates SSE instructions.)

Most of the floating-point values that I'm working with have a reasonably small upper bound, and precision for near-zero values isn't very important. So the thought occurred to me: maybe switching to fixed-point arithmetic could speed things up? I know the only way to be really sure is to measure it, that might take days, so I'd like to know the odds of success beforehand.

Fixed-point was all the rage back in the days of Doom, but I'm not sure where it stands anno 2010. Considering how much silicon is nowadays pumped into floating-point performance, is there a chance that fixed-point arithmetic will still give me a significant speed boost? Does anyone have any real-world experience that may apply to my situation?

like image 468
Thomas Avatar asked Apr 19 '10 12:04

Thomas


People also ask

Is fixed-point arithmetic faster?

Integers and fixed-point arithmetic in FPGA In other words, 10 out of the 16 bits are used to represent the fractional part and 6 bits for the integer part. Fixed-point arithmetic is widely used in FPGA-based algorithms because it usually runs faster and uses fewer resources when compared to floating-point arithmetic.

Why do we need fixed-point?

Fixed point is a simple yet very powerful way to represent fractional numbers in computer. By reusing all integer arithmetic circuits of a computer, fixed point arithmetic is orders of magnitude faster than floating point arithmetic. This is the reason why it is being used in many game and DSP applications.

Is fixed-point arithmetic faster than floating-point?

Fixed-point computations can be faster and/or use less hardware than floating-point ones. If the range of the values to be represented is known in advance and is sufficiently limited, fixed point can make better use of the available bits.

Would the FPU have to be used to perform arithmetic between an unsigned integer and a floating-point value?

No FPU is required. This is also called fixed-point arithmetic. It is like the mantissa part of a floating point number, without the exponent.


2 Answers

Stick with floating point. Fixed point is really only useful if you can work within 8 bits or 16 bits and use SIMD (image processing and audio are typical use cases for this).

Modern CPUs typically have 2 FPUs and you can issue up to 2 FP instructions per clock cycle. You also then have the possibility of optimisation using 4 way FP SIMD (SSE).

If you're still struggling to get good performance then try using a better compiler, such as Intel's ICC. Also, 64-bit Intel executables tend to be somewhat faster than their 32-bit counterparts due to the increased number of registers in the 64-bit model, so build for 64-bit if you can.

And of course you should profile your code too, so that you know for certain where the hotspots are. You don't say what OS you're using but VTune on Windows, Zoom on Linux or Shark on Mac OS X will all help you to quickly and easily find your performance bottlenecks.

like image 163
Paul R Avatar answered Oct 04 '22 15:10

Paul R


As other people have said, if you're already using floating-point SIMD, I doubt you'll get much improvement with fixed point.

You said that the compiler is emitting SSE instructions, but it doesn't sound like you've tried writing your vectorized SSE code. I don't know how good the compilers usually are at that, but it's something to investigate.

Two other areas to look at are:

  1. Memory access - if all your computations are done in SSE, then cache misses might be taking up more time than the actual math.

    1. You can prefetch data with e.g. _mm_prefetch or __builtin_prefetch (depending on your compiler/platform).
    2. Check your expensive functions for aliasing between inputs and outputs; these can lead to extra memory reads/writes.
    3. Consider storing your data differently - if your fluid solver solvers for x coordinates independently of y's, it might be more cache friendly to store them in different arrays. If they're solved for together, consider interleaving (e.g. x y x y...)
  2. Unrolling - you should be able to get a performance benefit from unrolling your inner loops. The goal is not (as many people think) to reduce the number of loop termination checks. The main benefit is to allow independent instructions to be interleaved, to hide the instruction latency. There a presentation here entitled VMX Optimization: Taking it up a Level which might help a bit; it's focused on Altivec instructions on Xbox360, but some of the unrolling advice might help on SSE as well.

As other people have mentioned, profile, profile, profile. And then let us know what's still slow :)

PS - on one of your other posts here, I convinced you to use SOR instead of Gauss-Seidel in your matrix solver. Now that I think about it, is there a reason that you're not using a tri-diagonal solver?

like image 27
celion Avatar answered Oct 04 '22 16:10

celion