Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tradeoff precision for speed to evaluate sign of dot product of two vectors in C++? (Not hardware specific)

Say I have two vectors of float A and B. I need to find of dot product of A and B, ie. sign(A.B) - if it is positive or negative or 0. Size of vectors is small, less than 100. However,I need to do this extremely fast!
You can assume all elements in A are floats in range [0,1] and that of B are [-500,+500]. I was looking for exact solution, but approximate solutions will do too if practically do not give a lot of wrong answers (I know, 'lot of' is subjective but I can't put an exact number on it without talking about hardware or implementation)

I explored Pragma compiler directives using -O4 worked the fastest. I explored some more improvements in implementation to make it parallizable based on autovectorization support of underlying processor. Like for avx instruction set, keep 8 independent variable and find dot product, so that all 8 of register capacity is utilized. But I think we can still be much more faster! Basic idea is, we only need to determine sign of dot product, so there is a lot of scope to tradeoff precision for speed. So i am trying to come up with some mathematical or algorithmic solution to make this tradeoff happen. One Idea i have is to use FFT (Fast fourier transform) to reduce number of multiplications. Another idea I tried to explore was bitwise tricks, but realized bitwise for float is not possible. (IEEE standards are not garunteed when you use fast pragma like Ofast or O3)

You might be thinking why is this such an important to optimize for such minor task but I think it can be a very useful question :-

  • Creative Solutions to this problem can be generalized to other similar situations which require precision over speed tradeoff.
  • Sign of dot product is a pretty widely applicable sub-problem that shows itself in a dozen scenarios (think complex number manipulation, hyperplane in several ML algorithms, etc)
like image 944
Xi chan Avatar asked Jun 13 '19 09:06

Xi chan


1 Answers

On modern architectures, the floating point calculation of the dot product already is very fast, 1 cycle will be spent for an addition and 1-2 cycles for the multiplication.

  1. I think the performance can only matter when a lot of dot products are calculated.
  2. Typically this implies that lots of data will have to be read
  3. This implies the runtime will be dominated by the memory bandwidth.
  4. This means, big performance gains can only be made by using smaller floating point types, i.e. 32 Bit or 16 Bit floating point numbers.
like image 95
Peter G. Avatar answered Sep 17 '22 04:09

Peter G.