How to generate random number inside pyCUDA kernel?

Tags:

I am using pyCUDA for CUDA programming. I need to use random number inside kernel function. CURAND library doesn't work inside it (pyCUDA). Since, there is lot of work to be done in GPU, generating random number inside CPU and then transferring them to GPU won't work, rather dissolve the motive of using GPU.

Supplementary Questions:

Is there a way to allocate memory on GPU using 1 block and 1 thread.
I am using more than one kernel. Do I need to use multiple SourceModule blocks?

405

asked Sep 12 '17 06:09

Bhaskar Dhariyal

1 Answers

Despite what you assert in your question, PyCUDA has pretty comprehensive support for CUrand. The GPUArray module has a direct interface to fill device memory using the host side API (noting that the random generators run on the GPU in this case).

It is also perfectly possible to use the device side API from CUrand in PyCUDA kernel code. In this use case the trickiest part is allocating memory for the thread generator states. There are three choices -- statically in code, dynamically using host memory side allocation, and dynamically using device side memory allocation. The following (very lightly tested) example illustrates the latter, seeing as you asked about it in your question:

import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray

code = """
    #include <curand_kernel.h>

    const int nstates = %(NGENERATORS)s;
    __device__ curandState_t* states[nstates];

    __global__ void initkernel(int seed)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t* s = new curandState_t;
            if (s != 0) {
                curand_init(seed, tidx, 0, s);
            }

            states[tidx] = s;
        }
    }

    __global__ void randfillkernel(float *values, int N)
    {
        int tidx = threadIdx.x + blockIdx.x * blockDim.x;

        if (tidx < nstates) {
            curandState_t s = *states[tidx];
            for(int i=tidx; i < N; i += blockDim.x * gridDim.x) {
                values[i] = curand_uniform(&s);
            }
            *states[tidx] = s;
        }
    }
"""

N = 1024
mod = SourceModule(code % { "NGENERATORS" : N }, no_extern_c=True, arch="sm_52")
init_func = mod.get_function("_Z10initkerneli")
fill_func = mod.get_function("_Z14randfillkernelPfi")

seed = np.int32(123456789)
nvalues = 10 * N
init_func(seed, block=(N,1,1), grid=(1,1,1))
gdata = gpuarray.zeros(nvalues, dtype=np.float32)
fill_func(gdata, np.int32(nvalues), block=(N,1,1), grid=(1,1,1))

Here there is an initialization kernel which needs to be run once to allocate memory for the generator states and initialize them with the seed, and then a kernel which uses those states. You will need to be mindful of malloc heap size limits if you want to run a lot of threads, but those can be manipulated via the PyCUDA driver API interface.

answered Oct 28 '22 04:10

2 revs

Related questions
                            
                                Strange error while using cudaMemcpy: cudaErrorLaunchFailure
                            
                                cuda understanding concurrent kernel execution
                            
                                Pitch alignment for 2D textures
                            
                                Compiling Eigen library with nvcc (CUDA)
                            
                                CUDA result returns garbage using very large array, but reports no error
                            
                                Are cudaMalloc and cudaFree synchronous or asynchronous calls?
                            
                                Reducing matrix rows or columns in CUDA
                            
                                Let nvidia K20c use old stream management way?
                            
                                Surface reference faster than Surface object
                            
                                CUDA - how much slower is transferring over PCI-E?
                            
                                Performance of atomic operations on shared memory
                            
                                Double-templated function instantiation fails
                            
                                Mixing C++ flavours in the same project
                            
                                External calls are not supported - CUDA
                            
                                CUDA: bank conflicts between different warps?
                            
                                CUDA: compilation of LLVM IR using NVPTX
                            
                                What is L1 cache used for in NVIDIA's maxwell GPUs?
                            
                                Solving general sparse linear systems in CUDA
                            
                                CUDA estimating threads per blocks and block numbers for 2D grid data
                            
                                Accessing class data members from within cuda kernel - how to design proper host/device interaction?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate random number inside pyCUDA kernel?

Tags:

cuda

pycuda

Bhaskar Dhariyal

People also ask

1 Answers

2 revs

Recent Activity

Donate For Us