CUDA C programming with 2 video cards

Tags:

I am very new to CUDA programming and was reading the 'CUDA C Programming Guide' provided by nvidia. (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf)

In the page 25, it has the following C code that does the matrix multiplication. Can you please tell me how can I make that code run on two devices? (if I have two nvida CUDA capable cards installed in my computer). Can you please show me with an example.

// Matrices are stored in row-major order: 
// M(row, col) = *(M.elements + row * M.stride + col) 
typedef struct { 
    int width; 
    int height; 
    int stride; 
    float* elements; 
} Matrix; 

// Get a matrix element 
__device__ float GetElement(const Matrix A, int row, int col) 
{ 
    return A.elements[row * A.stride + col]; 
} 

// Set a matrix element 
__device__ void SetElement(Matrix A, int row, int col, float value) 
{ 
    A.elements[row * A.stride + col] = value; 
} 

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is 
// located col sub-matrices to the right and row sub-matrices down 
// from the upper-left corner of A 
__device__ Matrix GetSubMatrix(Matrix A, int row, int col) 
{ 
    Matrix Asub; 
    Asub.width = BLOCK_SIZE; 
    Asub.height = BLOCK_SIZE; 
    Asub.stride = A.stride; 
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col]; 
    return Asub;
    } 

// Thread block size 
#define BLOCK_SIZE 16 

// Forward declaration of the matrix multiplication kernel 
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix); 

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE 
void MatMul(const Matrix A, const Matrix B, Matrix C) 
{ 
    // Load A and B to device memory 
    Matrix d_A; 
    d_A.width = d_A.stride = A.width; d_A.height = A.height; 
    size_t size = A.width * A.height * sizeof(float); 
    cudaMalloc(&d_A.elements, size); 
    cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice); 
    Matrix d_B; 
    d_B.width = d_B.stride = B.width; d_B.height = B.height; 
    size = B.width * B.height * sizeof(float); 
    cudaMalloc(&d_B.elements, size); 
    cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice); 

    // Allocate C in device memory 
    Matrix d_C; 
    d_C.width = d_C.stride = C.width; d_C.height = C.height; 
    size = C.width * C.height * sizeof(float); 
    cudaMalloc(&d_C.elements, size); 

    // Invoke kernel 
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y); 
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 

    // Read C from device memory 
    cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost); 

    // Free device memory 
    cudaFree(d_A.elements); 
    cudaFree(d_B.elements); 
    cudaFree(d_C.elements); 
} 

// Matrix multiplication kernel called by MatMul() 
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) 
{ 
    // Block row and column 
    int blockRow = blockIdx.y; 
    int blockCol = blockIdx.x; 

    // Each thread block computes one sub-matrix Csub of C 
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

    // Each thread computes one element of Csub 
    // by accumulating results into Cvalue 
    float Cvalue = 0; 

    // Thread row and column within Csub 
    int row = threadIdx.y; 
    int col = threadIdx.x; 

    // Loop over all the sub-matrices of A and B that are 
    // required to compute Csub 
    // Multiply each pair of sub-matrices together 
    // and accumulate the results 
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) 
    { 
        // Get sub-matrix Asub of A 
        Matrix Asub = GetSubMatrix(A, blockRow, m); 
        // Get sub-matrix Bsub of B 
        Matrix Bsub = GetSubMatrix(B, m, blockCol); 

        // Shared memory used to store Asub and Bsub respectively 
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 

        // Load Asub and Bsub from device memory to shared memory 
        // Each thread loads one element of each sub-matrix 
        As[row][col] = GetElement(Asub, row, col); 
        Bs[row][col] = GetElement(Bsub, row, col); 

        // Synchronize to make sure the sub-matrices are loaded 
        // before starting the computation 
        __syncthreads(); 

        // Multiply Asub and Bsub together 
        for (int e = 0; e < BLOCK_SIZE; ++e) 
            Cvalue += As[row][e] * Bs[e][col]; 

        // Synchronize to make sure that the preceding 
        // computation is done before loading two new 
        // sub-matrices of A and B in the next iteration 
        __syncthreads(); 
    } 

    // Write Csub to device memory 
    // Each thread writes one element 
    SetElement(Csub, row, col, Cvalue); 
}

858

asked Jul 16 '12 09:07

Dale B

2 Answers

There is no "automatic" way to run a CUDA kernel on multiple GPUs.

You will need to devise a way to decompose the matrix multiplication problem into independent operations that can be run in parallel (so one on each GPU in parallel). As a simple example:

C = A.B is equivalent to C = [A].[B1|B2] = [A.B1|A.B2] where B1 and B2 are suitably sized matrices containing the columns of the matrix B and | denotes columnwise concantenation. You can calculate A.B1 and A.B2 as separate matrix multiplication operations, and then perform the concatenation when copying the resulting submatrices back to host memory.

Once you have a suitable decomposition scheme, you then implement it using the standard multi-gpu facilities in the CUDA 4.x API. For a great overview of multi-GPU programming using the CUDA APIs, I recommend watching Paulius Micikevicius' excellent talk from GTC 2012, which available as a streaming video and PDF here.

answered Sep 27 '22 22:09

talonmies

The basics are described in the CUDA C Programming Guide under section 3.2.6.

Basically, you can set on which GPU your current host thread operates on by calling cudaSetDevice(). Still you have to write your own code, to decompose your routines to be split across multiple GPUs.

answered Sep 27 '22 21:09

Stephan Dollberg

Related questions
                            
                                Bit parity code for odd number of bits
                            
                                Define a macro off of the content of a macro
                            
                                Why would the same code yield different numeric results on 32 vs 64-bit machines?
                            
                                Whether C standard library is static library or dynamic library? [closed]
                            
                                What does a slash mean in an include statement?
                            
                                Size of remaining stack until a stack overflow occurs
                            
                                Possible to run multiple main loops?
                            
                                Two dimensional arrays and pointers
                            
                                tbb.dll not found
                            
                                Automatic create function block diagram from ansi c code [closed]
                            
                                Casting function pointers with different pointer types as an argument
                            
                                C11 thread-safety with respect to functions that return pointers to static buffers
                            
                                Why does GCC allocate separate stack space for local unions in different scopes?
                            
                                IIR coefficients for peaking EQ, how to pass them to vDSP_deq22?
                            
                                converting jdouble to double of type c
                            
                                reading and writing in chunks on linux using c
                            
                                Example of executable stack in Linux (i386 architecture)
                            
                                Decrypt WEP wlan profile key using CryptUnprotectData
                            
                                Turn off outlining permanently for VS2010 ( C/C++ )
                            
                                From the kernel to the user space (DMA)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA C programming with 2 video cards

Tags:

c

cuda

nvidia

Dale B

People also ask

2 Answers

talonmies

Stephan Dollberg

Recent Activity

Donate For Us