What is the latency and throughput of the RDRAND instruction on Ivy Bridge?

Q: How fast is RDRAND?

The physical maximum throughput of the DRNG on IVB is 800MBytes/s.

Q: How does RDRAND work?

A scientific application of RDRAND can be found in astrophysics. Radio observations of low-mass stars and brown dwarfs have revealed that a number of them emit bursts of radio waves. These radio waves are caused by magnetic reconnection, the same process that causes solar flares on the Sun.

Q: How does RDSEED work?

RDSEED outputs "true" random bits generated from entropy gathered from a sensor on the chip. RDRAND outputs bits generated from a pseudorandom number generator seeded by the true random number generator. According to Intel's documentation, RDSEED is slower, since gathering entropy is costly.

Tags:

assembly

intel

rdrand

I cannot find any info on agner.org on the latency or throughput of the RDRAND instruction. However, this processor exists, so the information must be out there.

Edit: Actually the newest optimization manual mentions this instruction. It is documented as <200 cycles, and a total bandwidth of at least 500MB/s on Ivy Bridge. But some more in-depth statistics on this instruction would be great since the latency and throughput is variable.

728

asked May 07 '12 14:05

user239558

2 Answers

I wrote librdrand. It's a very basic set of routines to use the RdRand instruction to fill buffers with random numbers.

The performance data we showed at IDF is from test software I wrote that spawns a number of threads using pthreads in Linux. Each thread pulls fills a memory buffer with random numbers using RdRand. The program measures the average speed and can iterate while varying the number of threads.

Since there is a round trip communications latency from each core to the shared DRNG and back that is longer than the time needed to generate a random number at the DRNG, the average performance obviously increases as you add threads, up until the maximum throughput is reached. The physical maximum throughput of the DRNG on IVB is 800MBytes/s. A 4 core IVB with 8 threads manages something of the order of 780Mbytes/s. With fewer threads and cores, lower numbers are achieved. The 500MB/s number is somewhat conservative, but when you're trying to make honest performance claims, you have to be.

Since the DRNG runs at a fixed frequency (800MHz) while the core frequencies may vary, the number of core clock cycles per RdRand varies, depending on the core frequency and the number of other cores simultaneously accessing the DRNG. The curves given in the IDF presentation are a realistic representation of what to expect. The total performance is affected a little by core clock frequency, but not much. The number of threads is what dominates.

One should be careful when measuring RdRand performance to actually 'use' the RdRand result. If you don't, I.E. you did this.. RdRand R6, RdRand R6,....., RdRand R6 repeated many times, the performance would read as being artificially high. Since the data isn't used before it is overwritten, the CPU pipeline doesn't wait for the data to come back from the DRNG before it issues the next instruction. The tests we wrote write the resulting data to memory that will be in on-chip cache so the pipeline stalls waiting for the data. That is also why hyperthreading is so much more effective with RdRand than with other sorts of code.

The details of the specific platform, clock speed, Linux version and GCC version were given in the IDF slides. I don't remember the numbers off the top of my head. There are chips available that are slower and chips available that are faster. The number we gave for <200 cycles per instruction is based on measurements of about 150 core cycles per instruction.

The chips are available now, so anyone well versed in the use of rdtsc can do the same sort of test.

150

answered Sep 22 '22 12:09

David Johnston

You'll find some relevant information at Intel Digital Random Number Generator (DRNG) Software Implementation Guide.

A verbatim quote follows:

Measured Throughput:

Up to 70 million RDRAND invocations per second 500+ million bytes of random data per second Throughput ceiling is insensitive to the number of contending parallel threads

answered Sep 23 '22 12:09

Eugene Smith

Related questions
                            
                                How do I do inline assembly on the IPhone?
                            
                                Cost of push vs. mov (stack vs. near memory), and the overhead of function calls
                            
                                Mixing C and assembly sources and build with cmake
                            
                                Getting the high part of 64 bit integer multiplication
                            
                                Modern x86 cost model
                            
                                Assembly Files: Difference between .a .s .asm
                            
                                x86 Assembly pointers
                            
                                While, Do While, For loops in Assembly Language (emu8086)
                            
                                x86 Assembly pushl/popl don't work with "Error: suffix or operands invalid"
                            
                                CPUID implementations in C++
                            
                                What's the purpose of the rotate instructions (ROL, RCL on x86)?
                            
                                What are CLD and STD for in x86 assembly language? What does DF do?
                            
                                Differences Between ARM Assembly and x86 Assembly [closed]
                            
                                What are the sizes of tword, oword and yword operands?
                            
                                Why does compiler inlining produce slower code than manual inlining?
                            
                                Translation of machinecode into LLVM IR (disassembly / reassembly of X86_64. X86. ARM into LLVM bitcode)
                            
                                Homoiconic and "unrestricted" self modifying code + Is lisp really self modifying?
                            
                                Why use LDR over MOV (or vice versa) in ARM assembly?
                            
                                What do the E and R prefixes stand for in the names of Intel 32-bit and 64-bit registers?
                            
                                What x86 register denotes source location in movsb instruction?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With