Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-GPU programming strategies using CUDA

Tags:

cuda

I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.

  1. Single host thread - multiple devices (shared context)
  2. Single host thread - concurrent execution of kernels on a single device (shared context)
  3. Multiple host threads - (Equal) Multiple devices (independent contexts)
  4. Single host thread - Sequential kernel execution on one device
  5. Multiple host threads - concurrent execution of kernels on one device (independent contexts)
  6. Multiple host threads - sequential execution of kernels on one device (independent contexts)

Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.

Thanks,
Sayan

EDIT:

I thought that the previous categorization involved some redundancy, so modified it.

like image 492
Sayan Avatar asked Jul 01 '11 17:07

Sayan


2 Answers

Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.

Your option 1) is a misnomer, though, because there is no "shared context" - the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.

like image 64
ArchaeaSoftware Avatar answered Sep 29 '22 11:09

ArchaeaSoftware


Multiple host threads - equal multiple devices, independent contexts is a winner if you can get away with it. This is assuming that you can get truly independent units of work. This should be true since your problem is embarassingly parallel.

Caveat emptor: I have not personally built a large scale multi-GPU system. I have built a successful single GPU system w/ 3 orders of magnitude acceleration relative to CPUs. Thus, the advice is generalization of the synchronization costs I've seen as well as discussion with my colleagues who have built multi-GPU systems.

like image 33
peakxu Avatar answered Sep 29 '22 11:09

peakxu