What is the difference in CPU cycles (or, in essence, in 'speed') between <pre class="prettyprint"><code> x /= y; </code></pre> and <pre class="prettyprint"><code> #include <cmath> x = sqrt(y); </code></pre> EDIT: I know the operations aren't equivalent, I'm just arbitrarily proposing <code>x /= y</code> as a benchmark for <code>x = sqrt(y)</code>

The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation. Firstly you should get a disassembler of your operations (from compiler e.g. gcc: <code>gcc file.c -O3 -S -o file.asm</code> or via dissasembly of compiled binary, e.g. with help of debugger). Remember, that In your operation there is loading and storing a value, which must be counted additionally. Here are two examples from friweb.hu: For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions) <ul> <li>29 ticks for 32-bit float; 58 ticks for 64-bit double; 69 ticks for 80-bit long double;</li> </ul> of DIVIDE (of floating point numbers): <ul> <li>18 ticks for 32-bit; 32 ticks for 64-bit; 38 ticks for 80-bit</li> </ul> For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU: Floating-point SQRT is <ul> <li>14 ticks for 32 bit; 21 ticks for 64 bit; 24 ticks for 80 bit</li> </ul> Floating-point DIVIDE is <ul> <li>14 ticks for 32 bit; 22 ticks for 64 bit; 24 ticks for 80 bit</li> </ul> SQRT even a tick faster for 32bit. So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same. For newer CPU, cost of both operations become lower that it was for older CPUs; For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit. Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.

c++ practical computational complexity of <cmath> SQRT()

Tags:

c++

complexity-theory

sqrt

cpu-cycles

cpu-time

What is the difference in CPU cycles (or, in essence, in 'speed') between

 x /= y;

and

 #include <cmath>
 x = sqrt(y);

EDIT: I know the operations aren't equivalent, I'm just arbitrarily proposing x /= y as a benchmark for x = sqrt(y)

942

asked Jul 30 '11 16:07

Matt Munson

1 Answers

The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation.

Firstly you should get a disassembler of your operations (from compiler e.g. gcc: gcc file.c -O3 -S -o file.asm or via dissasembly of compiled binary, e.g. with help of debugger). Remember, that In your operation there is loading and storing a value, which must be counted additionally.

Here are two examples from friweb.hu:

For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions)

29 ticks for 32-bit float; 58 ticks for 64-bit double; 69 ticks for 80-bit long double;

of DIVIDE (of floating point numbers):

18 ticks for 32-bit; 32 ticks for 64-bit; 38 ticks for 80-bit

For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU:

Floating-point SQRT is

14 ticks for 32 bit; 21 ticks for 64 bit; 24 ticks for 80 bit

Floating-point DIVIDE is

14 ticks for 32 bit; 22 ticks for 64 bit; 24 ticks for 80 bit

SQRT even a tick faster for 32bit.

So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same. For newer CPU, cost of both operations become lower that it was for older CPUs; For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit.

Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.

150

answered Oct 26 '22 12:10

osgx

Related questions
                            
                                When is (this != this) in C++?
                            
                                Does std::vector use the assignment operator of its value type to push_back elements?
                            
                                compressed vector/array class with random data access
                            
                                Any good C++0x overviews? [closed]
                            
                                C++: Read from text file and separate into variable
                            
                                Is reading an indeterminate value undefined behavior?
                            
                                GPU Shared Memory Bank Conflict
                            
                                C++: non-temporary const reference
                            
                                Size of Primitive data types
                            
                                c++ global object
                            
                                Get screen resolution programmatically in OS X
                            
                                OpenCV - getting the slider to update its position during video playback
                            
                                How can I fix this vs10 inconsistent dll linkage warning?
                            
                                CUDA - what if I choose too many blocks?
                            
                                Syntax of out-of-class definition of a template member function of a template class
                            
                                How to open and display a PDF file using Qt/C++?
                            
                                prefer conversion operator over conversion constructor
                            
                                Why OpenMP version is slower?
                            
                                Forwarding a shared_ptr without class declaration
                            
                                compiler support for stateful allocators in STL containers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With