How can I use two devices in order to improve for example the performance of the following code (sum of vectors)? Is it possible to use more devices "at the same time"? If yes, how can I manage the allocations of the vectors on the global memory of the different devices?
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> #include <cuda.h> #define NB 32 #define NT 500 #define N NB*NT __global__ void add( double *a, double *b, double *c); //=========================================== __global__ void add( double *a, double *b, double *c){ int tid = threadIdx.x + blockIdx.x * blockDim.x; while(tid < N){ c[tid] = a[tid] + b[tid]; tid += blockDim.x * gridDim.x; } } //============================================ //BEGIN //=========================================== int main( void ) { double *a, *b, *c; double *dev_a, *dev_b, *dev_c; // allocate the memory on the CPU a=(double *)malloc(N*sizeof(double)); b=(double *)malloc(N*sizeof(double)); c=(double *)malloc(N*sizeof(double)); // allocate the memory on the GPU cudaMalloc( (void**)&dev_a, N * sizeof(double) ); cudaMalloc( (void**)&dev_b, N * sizeof(double) ); cudaMalloc( (void**)&dev_c, N * sizeof(double) ); // fill the arrays 'a' and 'b' on the CPU for (int i=0; i<N; i++) { a[i] = (double)i; b[i] = (double)i*2; } // copy the arrays 'a' and 'b' to the GPU cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice); for(int i=0;i<10000;++i) add<<<NB,NT>>>( dev_a, dev_b, dev_c ); // copy the array 'c' back from the GPU to the CPU cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost); // display the results // for (int i=0; i<N; i++) { // printf( "%g + %g = %g\n", a[i], b[i], c[i] ); // } printf("\nGPU done\n"); // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); // free the memory allocated on the CPU free( a ); free( b ); free( c ); return 0; }
Thank you in advance. Michele
(MULTIple-Graphics Processing Units) Using two or more graphics cards in the same PC to support faster animation in video games.
So, for gaming, dual graphics card setups are definitely not worth the money. They're extremely expensive, especially at the moment when the GPU prices are skyrocketing. And for the price, you get support in a dozen games and extremely slim chances any future titles will come with SLI support.
To use data parallelism with PyTorch, you can use the DataParallel class. When using this class, you define your GPU IDs and initialize your network using a Module object with a DataParallel object. Then, when you call your object it can split your dataset into batches that are distributed across your defined GPUs.
Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.
Now it is possible to do something like this for the memory allocation part of your host code:
double *dev_a[2], *dev_b[2], *dev_c[2]; const int Ns[2] = {N/2, N-(N/2)}; // allocate the memory on the GPUs for(int dev=0; dev<2; dev++) { cudaSetDevice(dev); cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) ); cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) ); cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) ); }
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The basic idea here is that you use cudaSetDevice
to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].
The transfer of data from the host to device could be as simple as:
// copy the arrays 'a' and 'b' to the GPUs for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) { cudaSetDevice(dev); cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice); }
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The kernel launching section of your code could then look something like:
for(int i=0;i<10000;++i) { for(int dev=0; dev<2; dev++) { cudaSetDevice(dev); add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] ); } }
(disclaimer: written in browser, never compiled, never tested, use at own risk).
Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required. But, again, the basic idea is the same: use cudaSetDevice
to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.
You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With