How can I run tasks on the CPU and a GPU device concurrently?

Question

I have this piece of code that is as profiled, optimised and cache-efficient as I am likely to get it with my level of knowledge. It runs on the CPU conceptually like this:

#pragma omp parallel for schedule(dynamic)
  for (int i = 0; i < numberOfTasks; ++i)
  {
    result[i] = RunTask(i); // result is some array where I store the result of RunTask.
  }

It just so happens that RunTask() is essentially a set of linear algebra operations that operate repeatedly on the same, very large dataset every time, so it's suitable to run on a GPU. So I would like to achieve the following:

Offload some of the tasks to the GPU
While the GPU is busy, process the rest of the tasks on the CPU
For the CPU-level operations, keep my super-duper RunTask() function without having to modify it to comply with restrict(amp). I could of course design a restrict(amp) compliant lambda for the GPU tasks.

Initially I thought of doing the following:

// assume we know exactly how much time the GPU/CPU needs per task, and this is the 
// most time-efficient combination:
int numberOfTasks = 1000;
int ampTasks = 800;

// RunTasksAMP(start,end) sends a restrict(amp) kernel to the GPU, and stores the result in the
// returned array_view on the GPU
Concurrency::array_view<ResulType, 1> concurrencyResult = RunTasksAMP(0,ampTasks);

// perform the rest of the tasks on the CPU while we wait
#pragma omp parallel for schedule(dynamic)
  for (int i = ampTasks; i < numberOfTasks; ++i)
  {
    result[i] = RunTask(i); // this is a thread-safe
  }

// do something to wait for the parallel_for_each in RunTasksAMP to finish.
concurrencyResult.synchronize();
//... now load the concurrencyResult array into the first elements of "result"

But I doubt you could do something like this because

A call to parallel_for_each behaves as though it's synchronous

(http://msdn.microsoft.com/en-us/library/hh305254.aspx)

So is it possible to achieve 1-3 of my requests, or do I have to ditch number 3? Even so, how would I implement it?

Ade Miller · Accepted Answer

See my answer to will array_view.synchronize_asynch wait for parallel_for_each completion? for an explanation as to why parallel_for_each can be though of as a queuing or scheduling operation rather than a synchronous one. This explains why your code should satisfy your requirements 1 & 2. It should also meet requirement 3, although you might want to consider having one function that are restrict(cpu, amp) as this will give you less code to maintain.

However you may want to consider some of the performance implications of your approach.

Firstly, the parallel_for_each only queues work, the data copies from the host and GPU memory use host resources (assuming your GPU is discrete and/or does not support direct copy). If your work on the host saturates all the resources required to keep the GPU working then you may actually slow up your GPU calculation.

Secondly, for many calculations that are data parallel and amenable to running on a GPU they are so much faster that the additional overhead of trying to run work on the CPU doesn't result in an overall speedup. Overhead includes item one (above) and the additional overhead of coordinating work on the host (scheduling threads, merging the results, etc.).

Finally your implementation above does not take into account any variability in the time taken to run tasks on the GPU and CPU. It assumes that 800 AMP tasks will take as long as 200 cpu tasks. This may be true on some hardware but not on others. If one set of tasks takes longer than expected then your application will block and wait for the slower set of tasks to complete. You can avoid this using a master/worker pattern to pull tasks from a queue until there are no more available tasks. This approach means that in the worst case your application will have to wait for the final task to complete, not a block of tasks. Using the master/worker approach also means that your application will run with equal efficiency regardless of the relative CPU/GPU performance.

My book discusses examples of scheduling work across multiple GPUs using a master/worker (n-body) and parallel queue (cartoonizer). You can download the source code from CodePlex. Note that it deliberately does not cover sharing work on both CPU and GPU for the reasons outlined above based on discussions with the C++ AMP product team.

How can I run tasks on the CPU and a GPU device concurrently?

Tags:

c++

c++-amp

quant

1 Answers

Ade Miller

Recent Activity

Donate For Us

How can I run tasks on the CPU and a GPU device concurrently?

Tags:

c++

c++-amp

quant

1 Answers

Ade Miller

Related questions

Recent Activity

Donate For Us