Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How slow (how many cycles) is calculating a square root?

Tags:

performance

How slow (how many cycles) is calculating a square root? This came up in a molecular dynamics course where efficiency is important and taking unnecessary square roots had a noticeable impact on the running time of the algorithms.

like image 423
Anonymous Avatar asked Oct 11 '11 09:10

Anonymous


People also ask

Is Square Root slow?

sqrt() is "slow," not slow. Fundamentally, the operation is more complex than an addition or division.

Why is Square Root slow?

The square root function uses Newton's method to iteratively calculate the square root. It converges quadratically. Nothing will speed that up. The other functions, exp() and ln(x), have implementations that have their own convergence/complexity issues.


1 Answers

Based on Agner Fog's instruction tables for Core 2 65nm when comparing the SSE performance with FSQRT, FDIV, FMUL and FADD, it is about equal, but looks faster because it can't do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.

On Core2 45nm, FSQRT and FDIV root got faster, while FADD and FMUL haven't changed. Once again SSE performance is about the same.

Intel Core 2 (Merom, 65nm)

Instruction Operands Latency Reciprocal
throughput
FSQRT 6 - 69
FADD(P) r 3 1
FMUL(P) r 5 2
FDIV(R)(P) r 6 - 38 d 5 - 37 d
ADDSS/D xmm, xmm 3 1
ADDPS/D xmm, xmm 3 1
MULSS xmm, xmm 4 1
MULSD xmm, xmm 5 1
MULPS xmm, xmm 4 1
MULPD xmm, xmm 5 1
DIVSS xmm, xmm 6 - 18 d 5 - 17 d
DIVSD xmm, xmm 6 - 32 d 5 - 31 d
DIVPS xmm, xmm 6 - 18 d 5 - 17 d
DIVPD xmm, xmm 6 - 32 d 5 - 31 d
SQRTSS/PS xmm, xmm 6 - 29 6 - 29
SQRTSD/PD xmm, xmm 6 - 58 6 - 58
RSQRTSS/PS xmm, xmm 3 2

Intel Core 2 (Wolfdale, 45nm)

Instruction Operands Latency Reciprocal
throughput
FSQRT 6 - 20
FADD(P) r 3 1
FMUL(P) r 5 2
FDIV(R)(P) r 6 - 21 d 5 - 20 d
ADDSS/D xmm, xmm 3 1
ADDPS/D xmm, xmm 3 1
MULSS xmm, xmm 4 1
MULSD xmm, xmm 5 1
MULPS xmm, xmm 4 1
MULPD xmm, xmm 5 1
DIVSS xmm, xmm 6 - 13 d 5 - 12 d
DIVSD xmm, xmm 6 - 21 d 5 - 20 d
DIVPS xmm, xmm 6 - 13 d 5 - 12 d
DIVPD xmm, xmm 6 - 21 d 5 - 20 d
SQRTSS/PS xmm, xmm 6 - 13 5 - 12
SQRTSD/PD xmm, xmm 6 - 20 5 - 19
RSQRTSS/PS xmm, xmm 3 2

The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower than the values published elsewhere.

Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

d Round divisors or low precision give low values.

like image 153
harold Avatar answered Sep 28 '22 10:09

harold