multi-GPU basic usage

Tags:

How can I use two devices in order to improve for example the performance of the following code (sum of vectors)? Is it possible to use more devices "at the same time"? If yes, how can I manage the allocations of the vectors on the global memory of the different devices?

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> #include <cuda.h>  #define NB 32 #define NT 500 #define N NB*NT  __global__ void add( double *a, double *b, double *c);  //=========================================== __global__ void add( double *a, double *b, double *c){      int tid = threadIdx.x + blockIdx.x * blockDim.x;       while(tid < N){         c[tid] = a[tid] + b[tid];         tid += blockDim.x * gridDim.x;     }  }  //============================================ //BEGIN //=========================================== int main( void ) {      double *a, *b, *c;     double *dev_a, *dev_b, *dev_c;      // allocate the memory on the CPU     a=(double *)malloc(N*sizeof(double));     b=(double *)malloc(N*sizeof(double));     c=(double *)malloc(N*sizeof(double));      // allocate the memory on the GPU     cudaMalloc( (void**)&dev_a, N * sizeof(double) );     cudaMalloc( (void**)&dev_b, N * sizeof(double) );     cudaMalloc( (void**)&dev_c, N * sizeof(double) );      // fill the arrays 'a' and 'b' on the CPU     for (int i=0; i<N; i++) {         a[i] = (double)i;         b[i] = (double)i*2;     }      // copy the arrays 'a' and 'b' to the GPU     cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);     cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);      for(int i=0;i<10000;++i)         add<<<NB,NT>>>( dev_a, dev_b, dev_c );      // copy the array 'c' back from the GPU to the CPU     cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);      // display the results     // for (int i=0; i<N; i++) {     //      printf( "%g + %g = %g\n", a[i], b[i], c[i] );     //  }     printf("\nGPU done\n");      // free the memory allocated on the GPU     cudaFree( dev_a );     cudaFree( dev_b );     cudaFree( dev_c );     // free the memory allocated on the CPU     free( a );     free( b );     free( c );      return 0; }

Thank you in advance. Michele

978

asked May 10 '12 08:05

micheletuttafesta

1 Answers

Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.

Now it is possible to do something like this for the memory allocation part of your host code:

double *dev_a[2], *dev_b[2], *dev_c[2]; const int Ns[2] = {N/2, N-(N/2)};  // allocate the memory on the GPUs for(int dev=0; dev<2; dev++) {     cudaSetDevice(dev);     cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );     cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );     cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) ); }

(disclaimer: written in browser, never compiled, never tested, use at own risk).

The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].

The transfer of data from the host to device could be as simple as:

// copy the arrays 'a' and 'b' to the GPUs for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {     cudaSetDevice(dev);     cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);     cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice); }

(disclaimer: written in browser, never compiled, never tested, use at own risk).

The kernel launching section of your code could then look something like:

for(int i=0;i<10000;++i) {     for(int dev=0; dev<2; dev++) {         cudaSetDevice(dev);         add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );     } }

(disclaimer: written in browser, never compiled, never tested, use at own risk).

Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required. But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.

You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

200

answered Oct 21 '22 08:10

talonmies

Related questions
                            
                                When declaring a variable in javascript, is the default value null?
                            
                                Clearing memory securely and reallocations
                            
                                How to upload files and folders to AWS EC2 instance?
                            
                                Where does Ruby keep track of its open file descriptors?
                            
                                "android.view.View cannot be cast to android.view.ViewGroup" exception
                            
                                How to set default emacs background and foreground colors?
                            
                                How can I close a specific window using Selenium WebDriver with Java?
                            
                                Embedded tomcat 7 servlet 3.0 annotations not working
                            
                                Custom Json Serialization of class
                            
                                Apple Mach-O Linker Error ZBarSDK error when building for distribution
                            
                                module.exports client side
                            
                                python: converting an numpy array data type from int64 to int

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

multi-GPU basic usage

Tags:

micheletuttafesta

People also ask

1 Answers

talonmies

Recent Activity

Donate For Us