How slow (how many cycles) is calculating a square root?

Tags:

performance

How slow (how many cycles) is calculating a square root? This came up in a molecular dynamics course where efficiency is important and taking unnecessary square roots had a noticeable impact on the running time of the algorithms.

423

asked Oct 11 '11 09:10

Anonymous

1 Answers

Based on Agner Fog's instruction tables for Core 2 65nm when comparing the SSE performance with FSQRT, FDIV, FMUL and FADD, it is about equal, but looks faster because it can't do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.

On Core2 45nm, FSQRT and FDIV root got faster, while FADD and FMUL haven't changed. Once again SSE performance is about the same.

Intel Core 2 (Merom, 65nm)

Instruction Operands Latency Reciprocal
throughput

FSQRT 6 - 69

FADD(P) r 3 1

FMUL(P) r 5 2

FDIV(R)(P) r 6 - 38 ^d 5 - 37 ^d

ADDSS/D xmm, xmm 3 1

ADDPS/D xmm, xmm 3 1

MULSS xmm, xmm 4 1

MULSD xmm, xmm 5 1

MULPS xmm, xmm 4 1

MULPD xmm, xmm 5 1

DIVSS xmm, xmm 6 - 18 ^d 5 - 17 ^d

DIVSD xmm, xmm 6 - 32 ^d 5 - 31 ^d

DIVPS xmm, xmm 6 - 18 ^d 5 - 17 ^d

DIVPD xmm, xmm 6 - 32 ^d 5 - 31 ^d

SQRTSS/PS xmm, xmm 6 - 29 6 - 29

SQRTSD/PD xmm, xmm 6 - 58 6 - 58

RSQRTSS/PS xmm, xmm 3 2

Intel Core 2 (Wolfdale, 45nm)

Instruction Operands Latency Reciprocal
throughput

FSQRT 6 - 20

FADD(P) r 3 1

FMUL(P) r 5 2

FDIV(R)(P) r 6 - 21 ^d 5 - 20 ^d

ADDSS/D xmm, xmm 3 1

ADDPS/D xmm, xmm 3 1

MULSS xmm, xmm 4 1

MULSD xmm, xmm 5 1

MULPS xmm, xmm 4 1

MULPD xmm, xmm 5 1

DIVSS xmm, xmm 6 - 13 ^d 5 - 12 ^d

DIVSD xmm, xmm 6 - 21 ^d 5 - 20 ^d

DIVPS xmm, xmm 6 - 13 ^d 5 - 12 ^d

DIVPD xmm, xmm 6 - 21 ^d 5 - 20 ^d

SQRTSS/PS xmm, xmm 6 - 13 5 - 12

SQRTSD/PD xmm, xmm 6 - 20 5 - 19

RSQRTSS/PS xmm, xmm 3 2

The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower than the values published elsewhere.

Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

^d Round divisors or low precision give low values.

Instruction	Operands	Latency	Reciprocal throughput
FSQRT		6 - 69
FADD(P)	r	3	1
FMUL(P)	r	5	2
FDIV(R)(P)	r	6 - 38 ^d	5 - 37 ^d
ADDSS/D	xmm, xmm	3	1
ADDPS/D	xmm, xmm	3	1
MULSS	xmm, xmm	4	1
MULSD	xmm, xmm	5	1
MULPS	xmm, xmm	4	1
MULPD	xmm, xmm	5	1
DIVSS	xmm, xmm	6 - 18 ^d	5 - 17 ^d
DIVSD	xmm, xmm	6 - 32 ^d	5 - 31 ^d
DIVPS	xmm, xmm	6 - 18 ^d	5 - 17 ^d
DIVPD	xmm, xmm	6 - 32 ^d	5 - 31 ^d
SQRTSS/PS	xmm, xmm	6 - 29	6 - 29
SQRTSD/PD	xmm, xmm	6 - 58	6 - 58
RSQRTSS/PS	xmm, xmm	3	2

Instruction	Operands	Latency	Reciprocal throughput
FSQRT		6 - 20
FADD(P)	r	3	1
FMUL(P)	r	5	2
FDIV(R)(P)	r	6 - 21 ^d	5 - 20 ^d
ADDSS/D	xmm, xmm	3	1
ADDPS/D	xmm, xmm	3	1
MULSS	xmm, xmm	4	1
MULSD	xmm, xmm	5	1
MULPS	xmm, xmm	4	1
MULPD	xmm, xmm	5	1
DIVSS	xmm, xmm	6 - 13 ^d	5 - 12 ^d
DIVSD	xmm, xmm	6 - 21 ^d	5 - 20 ^d
DIVPS	xmm, xmm	6 - 13 ^d	5 - 12 ^d
DIVPD	xmm, xmm	6 - 21 ^d	5 - 20 ^d
SQRTSS/PS	xmm, xmm	6 - 13	5 - 12
SQRTSD/PD	xmm, xmm	6 - 20	5 - 19
RSQRTSS/PS	xmm, xmm	3	2

153

answered Sep 28 '22 10:09

harold

Related questions
                            
                                How do I create a DESC index in MySQL?
                            
                                Aggregate vs Sum Performance in LINQ
                            
                                C# List<T>.ToArray performance is bad?
                            
                                Android Studio got very slow and constantly "updating indices"
                            
                                Python string 'join' is faster (?) than '+', but what's wrong here?
                            
                                How long does it take for you to be comfortable with Haskell?
                            
                                Can not increase max_open_files for Mysql max-connections in Ubuntu 15
                            
                                Performance of Google's Go?
                            
                                ?: Operator Vs. If Statement Performance
                            
                                What is the most performant way for dynamic styling in React-Native?
                            
                                Improve pandas (PyTables?) HDF5 table write performance
                            
                                How well does Git LFS handle small files?
                            
                                Branchless internal merge slower than internal merge with branch
                            
                                Why does JIT order affect performance?
                            
                                How to speed up Android Studio compilation process
                            
                                What is the benefit of setting java.awt.headless=true?
                            
                                Core Graphics Performance on iOS
                            
                                Are there any good Clojure benchmarks?
                            
                                Android: Use UUID as primary key in SQLite
                            
                                Is there any runtime cost for Casting in Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How slow (how many cycles) is calculating a square root?

Tags:

performance

Anonymous

People also ask

1 Answers

Intel Core 2 (Merom, 65nm)

Intel Core 2 (Wolfdale, 45nm)

harold

Recent Activity

Donate For Us