Just curiosity about the standard <code>sqrt()</code> from math.h on GCC works. I coded my own <code>sqrt()</code> using Newton-Raphson to do it!

<blockquote> yeah, I know fsqrt. But how the CPU does it? I can't debug hardware </blockquote> Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software. Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix. Maybe also see <ul> <li> The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)</li> <li>https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication</li> <li>https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations</li> </ul> <hr> But however the hardware works, IEEE requires <code>sqrt</code> (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like <code>log</code> and <code>sin</code> do not have this requirement, and real library implementations usually aren't that accurate. (And x87 <code>fsin</code> is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.) See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD <code>sqrtsd</code> / <code>sqrtss</code> and their wider versions. I collected up the results in Floating point division vs floating point multiplication For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it. Unlike most instructions, <code>sqrt</code> performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).

How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

2 Answers

yeah, I know fsqrt. But how the CPU does it? I can't debug hardware

Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.

Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.

Maybe also see

The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations

But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)

See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication

For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.

Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).

155

answered Oct 18 '22 06:10

Peter Cordes

sqrt is defined by C, so most likely you have to look in glibc.

You did not specify which architecture you are asking for, so I think it's safe to assume x86-64. If that's the case, they are defined in:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrt.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtf.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtl.c

tl;dr they are simply implemented by calling the x86-64 square root instructions sqrts{sd}:

https://www.felixcloutier.com/x86/sqrtss
https://www.felixcloutier.com/x86/sqrtsd

Furthermore, and just for the sake of discussion, if you enable fast-math (something you probably should not do if you care about result precision), you will see that most compilers will actually inline the call and directly emit the sqrts{sd} instructions:

https://godbolt.org/z/Wb4unC

answered Oct 18 '22 07:10

CAFxX

Related questions
                            
                                Using buffer overflow to execute shell code
                            
                                Variable declaration and their memory addresses in C
                            
                                Is fopen() a thread safe function in Linux?
                            
                                what does it mean by type = 1 << 0? [duplicate]
                            
                                C language. Read from stdout
                            
                                Assignment and pointers, undefined behavior?
                            
                                Problems linking to sqlite3.h with gcc
                            
                                Accessing variables in allocated memory
                            
                                Why gcc disassembler allocating extra space for local variable?
                            
                                What data structures and algorithms are not implementable in C? [closed]
                            
                                POLLHUP vs POLLNVAL, or what is POLLHUP? [duplicate]
                            
                                Disadvantages of using the `-Wextra` flag when compiling in GCC
                            
                                How to store double with maximum precision
                            
                                for loop being ignored (optimized?) out
                            
                                Basic Linked List in C
                            
                                Free (deleting) allocated memory from the function readdir
                            
                                What int values are relevant for exit() in C?
                            
                                How can Microsoft say the size of a word in WinAPI is 16 bits?
                            
                                Different declarations of qsort_r on Mac and Linux
                            
                                Is `int` always signed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

Tags:

c

function

math

assembly

sqrt

ResearcherDaily

People also ask

2 Answers

Peter Cordes

CAFxX

Recent Activity

Donate For Us