Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AVX/SSE version of xorshift128+

I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is

#include <stdint.h>
uint64_t s[ 2 ];

uint64_t next(void) { 
    uint64_t s1 = s[ 0 ];
    const uint64_t s0 = s[ 1 ];
    s[ 0 ] = s0;
    s1 ^= s1 << 23; // a
    return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c
}

I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could you use these to speed up this code (assuming you want to make billions of such random numbers) and what is the expected limit to this speedup in practice?

like image 491
graffe Avatar asked Jun 02 '14 19:06

graffe


2 Answers

For anyone else who might reach this question, I think this C++ code implements correctly 4 xorshift128plus generators running in parallel, using AVX2:

__m256i xorshift128plus_avx2(__m256i &state0, __m256i &state1)
{
    __m256i s1 = state0;
    const __m256i s0 = state1;
    state0 = s0;
    s1 = _mm256_xor_si256(s1, _mm256_slli_epi64(s1, 23));
    state1 = _mm256_xor_si256(_mm256_xor_si256(_mm256_xor_si256(s1, s0),
                                               _mm256_srli_epi64(s1, 18)),
                              _mm256_srli_epi64(s0, 5));
    return _mm256_add_epi64(state1, s0);
}

The scalar implementation I used is:

u64 xorshift128plus(u64 &state0, u64 &state1)
{
    u64 s1 = state0;
    const u64 s0 = state1;
    state0 = s0;
    s1 ^= s1 << 23;                              // a
    state1 = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
    return state1 + s0;
}

Which is the same one in the xorshiftplus paper. Note that the right-shift constants from the original question do not correspond to the ones in the paper.

like image 123
o9000 Avatar answered Nov 05 '22 00:11

o9000


XorShift is indeed a good choice. It is so good, so fast and requires so little state that I'm surprised to see so little adoption. It should be the standard generator on all platforms. I have implemented it myself 8 years ago and even then it could generate 800MB/s of random bytes.

You cannot use vector instructions to speed up generating a single random number. There is too little instruction-level parallelism in those few instructions.

But you can easily speed up generating N numbers where N is the vector size of your target instruction set. Just run N generators in parallel. Keep state for N generators and generate N numbers at the same time.

If client code demands numbers one at a time you could keep a buffer of N (or more) numbers. If the buffer is empty you fill it using vector instructions. If the buffer is not empty you just return the next number.

like image 22
usr Avatar answered Nov 04 '22 23:11

usr