Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multi-GPU basic usage

Tags:

How can I use two devices in order to improve for example the performance of the following code (sum of vectors)? Is it possible to use more devices "at the same time"? If yes, how can I manage the allocations of the vectors on the global memory of the different devices?

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> #include <cuda.h>  #define NB 32 #define NT 500 #define N NB*NT  __global__ void add( double *a, double *b, double *c);  //=========================================== __global__ void add( double *a, double *b, double *c){      int tid = threadIdx.x + blockIdx.x * blockDim.x;       while(tid < N){         c[tid] = a[tid] + b[tid];         tid += blockDim.x * gridDim.x;     }  }  //============================================ //BEGIN //=========================================== int main( void ) {      double *a, *b, *c;     double *dev_a, *dev_b, *dev_c;      // allocate the memory on the CPU     a=(double *)malloc(N*sizeof(double));     b=(double *)malloc(N*sizeof(double));     c=(double *)malloc(N*sizeof(double));      // allocate the memory on the GPU     cudaMalloc( (void**)&dev_a, N * sizeof(double) );     cudaMalloc( (void**)&dev_b, N * sizeof(double) );     cudaMalloc( (void**)&dev_c, N * sizeof(double) );      // fill the arrays 'a' and 'b' on the CPU     for (int i=0; i<N; i++) {         a[i] = (double)i;         b[i] = (double)i*2;     }      // copy the arrays 'a' and 'b' to the GPU     cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);     cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);      for(int i=0;i<10000;++i)         add<<<NB,NT>>>( dev_a, dev_b, dev_c );      // copy the array 'c' back from the GPU to the CPU     cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);      // display the results     // for (int i=0; i<N; i++) {     //      printf( "%g + %g = %g\n", a[i], b[i], c[i] );     //  }     printf("\nGPU done\n");      // free the memory allocated on the GPU     cudaFree( dev_a );     cudaFree( dev_b );     cudaFree( dev_c );     // free the memory allocated on the CPU     free( a );     free( b );     free( c );      return 0; } 

Thank you in advance. Michele

like image 978
micheletuttafesta Avatar asked May 10 '12 08:05

micheletuttafesta


People also ask

What is multi GPU used for?

(MULTIple-Graphics Processing Units) Using two or more graphics cards in the same PC to support faster animation in video games.

Is MULTIple GPUs worth it?

So, for gaming, dual graphics card setups are definitely not worth the money. They're extremely expensive, especially at the moment when the GPU prices are skyrocketing. And for the price, you get support in a dozen games and extremely slim chances any future titles will come with SLI support.

How do you use multi GPU training PyTorch?

To use data parallelism with PyTorch, you can use the DataParallel class. When using this class, you define your GPU IDs and initialize your network using a Module object with a DataParallel object. Then, when you call your object it can split your dataset into batches that are distributed across your defined GPUs.


1 Answers

Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.

Now it is possible to do something like this for the memory allocation part of your host code:

double *dev_a[2], *dev_b[2], *dev_c[2]; const int Ns[2] = {N/2, N-(N/2)};  // allocate the memory on the GPUs for(int dev=0; dev<2; dev++) {     cudaSetDevice(dev);     cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );     cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );     cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) ); } 

(disclaimer: written in browser, never compiled, never tested, use at own risk).

The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].

The transfer of data from the host to device could be as simple as:

// copy the arrays 'a' and 'b' to the GPUs for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {     cudaSetDevice(dev);     cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);     cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice); } 

(disclaimer: written in browser, never compiled, never tested, use at own risk).

The kernel launching section of your code could then look something like:

for(int i=0;i<10000;++i) {     for(int dev=0; dev<2; dev++) {         cudaSetDevice(dev);         add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );     } } 

(disclaimer: written in browser, never compiled, never tested, use at own risk).

Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required. But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.

You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

like image 200
talonmies Avatar answered Oct 21 '22 08:10

talonmies