Internal parallelization by CPU

Question

I've been playing around with Xorshift* random number generators, and I came across this exploration of their properties. Quoting from that site (emphasis mine):

How can a xorshift64* generator be slower than a xorshift1024* generator?

Dependencies. The three xor/shifts of a xorshift64* generator must be executed sequentially, as each one is dependent on the result of the previous one. In a xorshift1024* generator two of the xor/shifts are completely independent and can be parallelized internally by the CPU. I also suspect that the larger state space makes it possible for the CPU to perform more aggressively speculative execution (indeed, a xorshift128* generator is slower than a xorshift1024* generator).

What does this parallelization internally by the CPU statement mean? I took it to mean that the CPU would use vector instructions to execute the two xor/shifts concurrently, but I haven't been able to see evidence of this in the assembly output by the compiler. Is this a deep CPU pipelining thing? Or should I be able to see something going on in the produced assembler?

unwind · Accepted Answer

Yes, it's an instruction-level parallelism thing.

Basically such a CPU will have more execution hardware available than is needed for each individual instruction, so it "spreads out" a bunch of instructions over the available resources, then merges the results back so that, to the programmer, it still looks like things are happening sequentially.

What you can see, if you're good at it, is two adjacent instructions that both do work, but that have no dependency. For instance they might be operating on non-overlapping sets of registers only. For such cases, you can guess they might be executed in parallel, resulting in a high instructions-per-cycle value for that particular bit of code.

To make this a bit more concrete, let's look at the two pieces of code you're talking about (also: learning opportunity for me).

Here's the core of xorshift64*:

x ^= x >> 12; // a
x ^= x << 25; // b
x ^= x >> 27; // c
return x * 2685821657736338717LL;

Actually, that's all the code in the function (x is an uint64_t). Clearly every line is touching the state, and modifying it, so each statement depends on the one right before it. To contrast, here's xorshift1024+:

uint64_t s0 = s[ p ];
uint64_t s1 = s[ p = ( p + 1 ) & 15 ];
s1 ^= s1 << 31; // a
s1 ^= s1 >> 11; // b
s0 ^= s0 >> 30; // c
return ( s[ p ] = s0 ^ s1 ) * 1181783497276652981LL;

Here, the global state is in the uint64_t s[16], p variables. Given that, it's perhaps not crystal clear but at least somewhat hinted at, that the line with the // c comment is not sharing any state with the line before it. Thus, it's doing both shifts and an XOR (i.e. "work") that's independent from the similar work being done right before it. So a superscalar processor might be able to run those two lines in parallel, more or less.

Internal parallelization by CPU

Tags:

c++

performance

c

prng

Theolodus

1 Answers

unwind

Recent Activity

Donate For Us

Internal parallelization by CPU

Tags:

c++

performance

c

prng

Theolodus

1 Answers

unwind

Related questions

Recent Activity

Donate For Us