Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Putting a for loop in a CUDA Kernel

Tags:

c++

c

cuda

Is a bad idea to put a for loop in a kernel?
or is it a common thing to do?

like image 431
lin Avatar asked Feb 03 '23 15:02

lin


1 Answers

It's common to put loops into kernels. It doesn't mean it's always a good idea, but it doesn't mean that it's not, either.

The general problem of deciding exactly how to effectively distribute your tasks and data and exploit related parallelisms is a very hard and unsolved problem, especially when it comes to CUDA. Active research is being carried out to determine efficiently (i.e., without blindly exploring the parameter space) how to achieve best results for given kernels.

Sometimes, it can make a lot of sense to put loops into kernels. For instance, iterative computations on many elements of a large, regular data structure exhibiting strong data independence are ideally suited to kernels containing loops. Other times, you may decide to have each thread process many data points, if e.g. you'd not have enough shared memory to allocate one thread per task (this isn't uncommon when a large number of threads share a large amount of data, and by increasing the amount of work done per thread, you can fit all the threads' shared data in shared memory).

Your best bet is to make an educated guess, test, profile, and revise as you need. There's a lot of room to play around with optimizations... launch parameters, global vs. constant vs. shared memory, keeping the number of registers cool, ensuring coalescing and avoiding memory bank conflicts, etc. If you're interested in performance, you should check out the "CUDA C Best Practices" and "CUDA Occupancy Calculator" available from NVIDIA on the CUDA 4.0 documentation page (if you haven't already).

like image 110
Patrick87 Avatar answered Feb 05 '23 06:02

Patrick87