Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most efficient and portable way to generate Gaussian random numbers in cython?

I am writing a cython application where I need to generate a Gaussian random variable on-the-fly in a tight nested loop. I would like to do this without introducing any extra dependencies, e.g., on GSL.

For a minimal version of the way I am currently able to do this with uniform random numbers on-the-fly:

from libc.stdlib cimport rand, RAND_MAX
import numpy as np

cdef double random_uniform():
    cdef double r = rand()
    return r/RAND_MAX

def my_function(int n):
    cdef int i
    cdef double[:] result = np.zeros(n, dtype='f8', order='C')
    for i in range(n):
        result[i] = random_uniform()
    return result

The above code is functionally equivalent to numpy.random.rand(n), and can be compiled with the following minimal setup file:

from distutils.core import setup
from Cython.Build import cythonize
import numpy as np

setup(ext_modules=cythonize("example.pyx"), include_dirs=[np.get_include()])

# compile instructions:
# python setup.py build_ext --inplace

To answer this question, what I am looking for is the same kind of minimal solution for the functional equivalent of np.random.randn(n), again ideally with any dependency directly cimported from libc.stdlib for reasons of portability.

There is an example implementation on the Wikipedia entry for the Box-Muller algorithm, but I have had trouble implementing it due to the way that the constant epsilon is defined.

like image 566
aph Avatar asked Mar 13 '17 15:03

aph


People also ask

What function in random module should you use to generate random numbers from range?

We can use randbelow() function from the secrets module to generate random integers. The below example uses randbelow() to randomly print integers.

Which distribution is used to generate random numbers?

The uniform probability distribution is used to generate random numbers from other distributions and also is useful as a “first guess” if no information about a random variable X is known other than that it is between a and b.


2 Answers

I created a function that generates gaussian-distributed random numbers based on the polar version of the Box-Muller transformation, as described by the pseudocode here. (I originally found this on the page archived here.)

This method generates two gaussian-distributed random numbers at a time. That means to get full cython speed, we need to figure out a way to pass two numbers around without turning them into Python objects. The most straightforward way to do so (that I can think of) is to pass the buffer in for direct manipulation by the generator. That's what my_gaussian_fast does, and it beats numpy by a modest margin.

from libc.stdlib cimport rand, RAND_MAX
from libc.math cimport log, sqrt
import numpy as np
import cython

cdef double random_uniform():
    cdef double r = rand()
    return r / RAND_MAX

cdef double random_gaussian():
    cdef double x1, x2, w

    w = 2.0
    while (w >= 1.0):
        x1 = 2.0 * random_uniform() - 1.0
        x2 = 2.0 * random_uniform() - 1.0
        w = x1 * x1 + x2 * x2

    w = ((-2.0 * log(w)) / w) ** 0.5
    return x1 * w

@cython.boundscheck(False)
cdef void assign_random_gaussian_pair(double[:] out, int assign_ix):
    cdef double x1, x2, w

    w = 2.0
    while (w >= 1.0):
        x1 = 2.0 * random_uniform() - 1.0
        x2 = 2.0 * random_uniform() - 1.0
        w = x1 * x1 + x2 * x2

    w = sqrt((-2.0 * log(w)) / w)
    out[assign_ix] = x1 * w
    out[assign_ix + 1] = x2 * w

@cython.boundscheck(False)
def my_uniform(int n):
    cdef int i
    cdef double[:] result = np.zeros(n, dtype='f8', order='C')
    for i in range(n):
        result[i] = random_uniform()
    return result

@cython.boundscheck(False)
def my_gaussian(int n):
    cdef int i
    cdef double[:] result = np.zeros(n, dtype='f8', order='C')
    for i in range(n):
        result[i] = random_gaussian()
    return result

@cython.boundscheck(False)
def my_gaussian_fast(int n):
    cdef int i
    cdef double[:] result = np.zeros(n, dtype='f8', order='C')
    for i in range(n // 2):  # Int division ensures trailing index if n is odd.
        assign_random_gaussian_pair(result, i * 2)
    if n % 2 == 1:
        result[n - 1] = random_gaussian()

    return result

Tests. Here's a uniform benchmark:

In [3]: %timeit numpy.random.uniform(size=10000)
10000 loops, best of 3: 130 µs per loop

In [4]: %timeit numpy.array(example.my_uniform(10000))
10000 loops, best of 3: 85.4 µs per loop

So this is definitely faster than numpy for ordinary random numbers. And if we're smart about it, it's faster for gaussian random numbers too:

In [5]: %timeit numpy.random.normal(size=10000)
1000 loops, best of 3: 393 µs per loop

In [6]: %timeit numpy.array(example.my_gaussian(10000))
1000 loops, best of 3: 542 µs per loop

In [7]: %timeit numpy.array(example.my_gaussian_fast(10000))
1000 loops, best of 3: 266 µs per loop

As confirmed by Robert Kern, numpy uses both values generated. my_gaussian throws one away; my_gaussian_fast uses both and stores them quickly. (See this answer's history for a naive my_gaussian_pair that tries to return the pair in a slow way.)

like image 162
senderle Avatar answered Oct 14 '22 11:10

senderle


You say that you have issues implementing the Box-Muller transform because of the way they define espilon:

const double epsilon = std::numeric_limits<double>::min();

According to here this is the C equivalent:

const double lowest_double = -DBL_MAX;

So to get the correct imports in Cython:

from libc.float import DBL_MAX #it should still be portable btw.

Now the issue with epsilon should be resolved.

like image 1
Dair Avatar answered Oct 14 '22 11:10

Dair