floating point operations per cycle - intel

Tags:

I have been looking for quite a while and cannot seem to find an official/conclusive figure quoting the number of single precision floating point operations/clock cycle that an Intel Xeon quadcore can complete. I have an Intel Xeon quadcore E5530 CPU.

I'm hoping to use it to calculate the maximum theoretical FLOP/s my CPU can achieve.

MAX FLOPS = (# Number of cores) * (Clock Frequency (cycles/sec) ) * (# FLOPS / cycle)

Anything pointing me in the right direction would be useful. I have found this FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

Intel Core 2 and Nehalem:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication

8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

But I'm not sure where these figures were found. Are they assuming a fused multiply add (FMAD) operation?

EDIT: Using this, in DP I calculate the correct DP arithmetic throughput cited by Intel as 38.4 GFLOP/s (cited here). For SP, I get double that, 76.8 GFLOP/s. I'm pretty sure 4 DP FLOP/cycle and 8 SP FLOP/cycle is correct, I just want confirmation of how they got the FLOPs/cycle value of 4 and 8.

723

asked Apr 21 '14 18:04

user3495341

1 Answers

Nehalem is capable of executing 4 DP or 8 SP FLOP/cycle. This is accomplished using SSE, which operates on packed floating point values, 2/register in DP and 4/register in SP. In order to achieve 4 DP FLOP/cycle or 8 SP FLOP/cycle the core has to execute 2 SSE instructions per cycle. This is accomplished by executing a MULDP and an ADDDP (or a MULSP and an ADDSP) per cycle. The reason this is possible is because Nehalem has separate execution units for SSE multiply and SSE add, and these units are pipelined so that the throughput is one multiply and one add per cycle. Multiplies are in the multiplier pipeline for 4 cycles in SP and 5 cycles in DP. Adds are in the pipeline for 3 cycles independent of SP/DP. The number of cycles in the pipeline is known as the latency. To compute peak FLOP/cycle all you need to know is the throughput. So with a throughput of 1 SSE vector instruction/cycle for both the multiplier and the adder (2 execution units) you have 2 x 2 = 4 FLOP/cycle in DP and 2 x 4 = 8 FLOP/cycle in SP. To actually sustain this peak throughput you need to consider latency (so you have at least as many independent operations in the pipeline as the depth of the pipeline) and you need to consider being able to feed the data fast enough. Nehalem has an integrated memory controller capable of very high bandwidth from memory which it can achieve if the data prefetcher correctly anticipates the access pattern of the data (sequentially loading from memory is a trivial pattern that it can anticipate). Typically there isn't enough memory bandwidth to sustain feeding all cores with data at peak FLOP/cycle, so some amount of reuse of the data from the cache is necessary in order to sustain peak FLOP/cycle.

Details on where you can find information on the number of independent execution units and their throughput and latency in cycles follows.

See page 105 8.9 Execution units of this document

http://www.agner.org/optimize/microarchitecture.pdf

It says that for Nehalem

The floating point multiplier on port 0 has a latency of 4 for single precision and 5 for double and long double precision. The throughput of the floating point multiplier is 1 operation per clock cycle, except for long double precision on Core2. The floating point adder is connected to port 1. It has a latency of 3 and is fully pipelined.

In order to get 8 SP FLOP/cycle you need 4 SP ADD/cycle and 4 SP MUL/cycle. The adder and the multiplier are on separate execution units, and dispatch out of separate ports, each can execute on 4 SP packed operands simultaneously using SSE packed (vector) instructions (4x32bit = 128bits). Both have throughput of 1 operation per clock cycle. In order to get that throughput, you need to consider the latency... how many cycles after the instruction issues before you can use the result.. so you have to issue several independent instructions to cover the latency. The multiplier in single precision has a latency of 4 and the adder of 3.

You can find these same throughput and latency numbers for Nehalem in the Intel Optimization guide, table C-15a

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

answered Sep 21 '22 19:09

amdn

Related questions
                            
                                Tradeoffs for large integer multipliers in hardware [closed]
                            
                                Pythons' sleep() CPU usage
                            
                                What does perf's option to measure events at user and kernel levels mean?
                            
                                Make a Dockerfile that compiles a Tensorflow binary to use: SSE4.1, SSE4.2 and AVX instructions
                            
                                How to put my structure variable into CPU caches to eliminate main memory page access time? Options
                            
                                Off-chip memcpy?
                            
                                Why is protected mode needed in addition to compatibility mode in Intel x86 64 bit CPUs?
                            
                                Not able to bind kernel threads to CPU
                            
                                How to take advantage of multi-cpu in c++?
                            
                                Using valgrind to measure cache misses [closed]
                            
                                Linux (Debian 8 Jessie) HRTimer - Kernel - Leap Seconds
                            
                                Is it possible to have a persistent cuda kernel running and communicating with cpu asynchronously ?
                            
                                Trying to disable Processor idle states (C states) on Windows PC
                            
                                Union and endianness
                            
                                Throttling CPU from within Java
                            
                                What does "CPU performs an endless jump" mean?
                            
                                Does a hyperthreading CPU implement parallelism or just concurrency?
                            
                                Why is ThreadPoolExecutor's default max_workers decided based on the number of CPUs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

floating point operations per cycle - intel

Tags:

cpu-architecture

cpu

intel

flops

nehalem

user3495341

People also ask

1 Answers

amdn

Recent Activity

Donate For Us