i'm a newbie in multi-gpu programing and I have some questions about multi-gpu computing. For instance let's take the dot-product example. I'm running a CPU-thread that creates 2 large arrays A[N] and B[N]. Due to the size of these arrays I need to split the computation of their dot product into 2 GPUs, both Tesla M2050(compute capability 2.0). The problem is that I need to compute these dot-products several times inside a do-loop controlled by my CPU-thread. Each dot-product requires the result of the previous one. I've read about creating 2 different threads that control the 2 different GPUs separately(as described on cuda by example) but I got no clue about how to synchronize and exchange data between them. Is there another alternative? I'd really appreciate any kind of help/example.Thanks in advance!
Before CUDA 4.0, multi-GPU programming required multi-threaded CPU programming. This can be challenging especially when you need to synchronize and/or communicate between the threads or GPUs. And if all of your parallelism is in your GPU code, then having multiple CPU threads may add to the complexity of your software without improving performance further beyond what the GPU does.
So, starting with CUDA 4.0, you can easily program multiple GPUs from a single-threaded host program. Here are some slides I presented last year about this.
Programming multiple GPUs can be as simple as this:
int numDevs = 0;
cudaGetNumDevices(&numDevs);
...
for (int d = 0; d < numDevs; d++) {
cudaSetDevice(d);
kernel<<<blocks, threads>>>(args);
}
For your specific example of dot products, you could use thrust::inner_product
as a starting point. I would do that for prototyping. But see my comments at the end about bandwidth bottlenecks.
Since you didn't provide enough detail about your outer loop that runs the dot products multiple times, I didn't attempt to do anything with that.
// assume the deviceIDs of the two 2050s are dev0 and dev1.
// assume that the whole vector for the dot product is on the host in h_data
// assume that n is the number of elements in h_vecA and h_vecB.
int numDevs = 0;
cudaGetNumDevices(&numDevs);
...
float result = 0.f;
for (int d = 0; d < numDevs; d++) {
cudaSetDevice(d);
device_vector<float> vecA(h_vecA + d*(n/d), h_vecA + (d+1)*(n/d)-1);
device_vector<float> vecB(h_vecB + d*(n/d), h_vecB + (d+1)*(n/d)-1);
result += thrust::inner_product(vecA.begin(), vecA.end(), vecB.begin(), 0.f);
}
(I admit that the indexing above is not correct if n is not an even multiple of numDevs, but I'll leave fixing that as an exercise for the reader. :)
This is simple, and is a great start. Get it working first, then optimize.
Once you have it working, if all you are doing on the devices is dot products, you will find you are bandwidth bound -- mostly by PCI-e, and you also won't get concurrency between the devices because thrust::inner_product
is synchronous due to the read back to return the result.. So you could use cudaMemcpyAsync (the device_vector
constructor will use cudaMemcpy). But the easier and likely more efficient approach would be to use "zero copy" -- directly access host memory (also discussed in the multi-gpu programming presentation linked above). Since all you are doing is reading each value once and adding it to the sum (the parallel reuse happens in a shared memory copy), you might as well read it from the host directly rather than copying it from host to device, and then reading it from the device memory in the kernel. Also, you would want asynchronous launches of the kernel on each GPU, to ensure maximum concurrency.
You could do something like this:
int bytes = sizeof(float) * n;
cudaHostAlloc(h_vecA, bytes, cudaHostAllocMapped | cudaHostAllocPortable);
cudaHostAlloc(h_vecB, bytes, cudaHostAllocMapped | cudaHostAllocPortable);
cudaHostAlloc(results, numDevs * sizeof(float), cudaHostAllocMapped | cudaHostAllocPortable);
// ... then fill your input arrays h_vecA and h_vecB
for (int d = 0; d < numDevs; d++) {
cudaSetDevice(d);
cudaEventCreate(event[d]));
cudaHostGetDevicePointer(&dptrsA[d], h_vecA, 0);
cudaHostGetDevicePointer(&dptrsB[d], h_vecB, 0);
cudaHostGetDevicePointer(&dresults[d], results, 0);
}
...
for (int d = 0; d < numDevs; d++) {
cudaSetDevice(d);
int first = d * (n/d);
int last = (d+1)*(n/d)-1;
my_inner_product<<<grid, block>>>(&dresults[d],
vecA+first,
vecA+last,
vecB+first, 0.f);
cudaEventRecord(event[d], 0);
}
// wait for all devices
float total = 0.0f;
for (int d = 0; d < devs; d++) {
cudaEventSynchronize(event[d]);
total += results[numDevs];
}
To create several threads, you can use either OpenMP or pthreads. To do what you're talking about, it seems like you would need to make and launch two threads (omp parallel section, or pthread_create), have each one do its part of the computation and store its intermediate result in separate process-wIDE variables (recall, global variables are automatically shared among threads of a process, so the original thread will be able to see changes made by the two spawned threads). To get the original threads to wait for the others to complete, synchronize (using a global barrier or thread join operation) and combine the results in the original thread after the two spawned threads are complete (if you're splitting the arrays in half and computing the dot product by multiplying corresponding elements and performing a global summation reduction on the halves, it should only be necessary to add the two intermediate results from the two spawned threads).
You can also use MPI or fork, in which case communication could be done in a way similar to network programming... pipes/sockets or communication and synchronization via (blocking) sends and receives.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With