Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are math libraries often compared by FLOPS?

Math libraries are very often compared based on FLOPS. What information is being conveyed to me when I'm shown a plot of FLOPS vs size with sets of points for several different math libraries?

FLOPS as a measure of performance would make more sense to me if the comparison was between two implementations of the same algorithm or between the same software on two different pieces of hardware. I don't understand why it's an appropriate or popular way to compare things like matrix-matrix multiply.

Is the implication just that the underlying algorithms are nearly the same and the code that feeds the floating point units the fastest by minimizing overhead wins?

Examples abound.

http://eigen.tuxfamily.org/index.php?title=Benchmark

https://code.google.com/p/blaze-lib/wiki/Benchmarks

https://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function

On the other hand, these LAPACK and Armadillo benchmarks use absolute time for a given operation, which makes more sense to me.

http://www.netlib.org/lapack/lug/node71.html

http://arma.sourceforge.net/speed.html

Relevant:

What is FLOP/s and is it a good measure of performance?

like image 685
Praxeolitic Avatar asked May 22 '15 22:05

Praxeolitic


2 Answers

Typically people compare math libraries in order to select the one which minimizes the runtime of their program. For such benchmarks, two things to consider are: performance of libraries on a given input and if that input is representative for your use case.

If we assume each task (e.g. vector scaling) requires the same number of floating point ops, then one would expect the library with the most FLOPS to finish first.

The assumption that each library will do the same number of floating point operations is reasonable in some cases. But it is completely possible that two libraries will require a different number of floating point operations for the same task (such as matrix matrix multiplication). If this is the case, then a library may do fewer FLOPS but finish in less time than a library that does more FLOPS. Hence, in these cases, total runtime is reasonable to look at. If the authors are publishing comparisons in FLOPS, that implies they believe every library is doing the same number of ops in total; or are simply dividing the number of operations it takes to theoretically complete the task by the total runtime (which is also common). You'd want to check to see the benchmark methodology.

The purpose of comparing performance (e.g. FLOPS) vs size is help people understand the performance on representative input for their use case. If you know you'll have a lot of small vectors, like less than size 10, then you don't care how fast the library is for vectors 1gb in size and don't want those inputs to influence the comparison.

Generally, counting FLOPS has been popular (maybe in part because it is easy to explain to mathematicians). I suppose one motivation is that saying "you can sale a size=10 vector at 10000 FLOPS but a size=100 vector at 100 FLOPS" is easier to digest than saying "you can scale a size=10 vector in 0.001 seconds but a size=100 vector in 1 second." If you report total runtime, you'll likely want to scale by the input size for comparisons.

like image 188
hazydev Avatar answered Sep 20 '22 17:09

hazydev


In high performance computing, one aim is often to utilise as much of the hardware capability as possible in the minimum time. This minimises the time spent (by humans, or by other time-sensitive systems) waiting for results. In large computing facilities, the operating costs (power consumed, manpower for maintenance, etc) are often - approximately - constant with time, so time for calculations translates directly to the bottom line (money paid to do the calculations).

FLOPS gives a measure of how well the algorithm is utilising the CPU. A measurement of FLOPS for an algorithm divided by the number of FLOPS that CPU is capable of gives a fraction between 0 and 1. The closer to 1, the more efficiently the algorithm uses the CPU, which translates into bang for buck on that type of CPU (i.e. the algorithm produces a solution faster, so the net cost is less).

The result is specific to CPU (instruction set) and algorithm. But if an algorithm gives a small result on a particular CPU, it is not utilising that CPU well. That can drive choice of different algorithms, different compilation settings (e.g. to optimise differently, or pick different instructions), choice of a server farm on which the algorithm will run more efficiently, etc etc. For large calculations that are done repeatedly (every day) the cost benefit can be large for using an algorithm that utilises the CPU efficiently versus one that uses it inefficiently.

like image 41
Peter Avatar answered Sep 23 '22 17:09

Peter