Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why launch a multiple of 32 number of threads in CUDA?

I took a course in CUDA parallel programming and I have seen many examples of CUDA thread configuration where it is common to round up the number of threads needed to the closest multiple of 32. I understand that threads are grouped into warps, and that if you launch 1000 threads, the GPU will round it up to 1024 anyways, so why do it explicitly then?

like image 827
Michael Avatar asked Oct 28 '14 14:10

Michael


People also ask

What can limit a program from launching the maximum number of threads on a GPU?

Hardwire limits the number of blocks in a single launch to 65,535. Hardwire also limits the number of threads per block with which we can launch a kernel.

How many threads are there in Nvidia CUDA warp?

NVIDIA GPUs execute warps of 32 parallel threads using SIMT, which enables each thread to access its own registers, to load and store from divergent addresses, and to follow divergent control flow paths.

How many threads can parallel CUDA have?

There are 32 threads per warp. That is a constant across all cuda card as of now.

How many threads does a core CUDA have?

CUDA Streaming Multiprocessor executes threads in warps (32 threads) There is a maximum of 1024 threads per block (for our GPU) There is a maximum of 1536 threads per multiprocessor (for our GPU)


1 Answers

The advice is generally given in the context of situations where you might conceivably choose various threadblock sizes to solve the same problem.

Let's take vector add as an example. Suppose my vector is of length 100000. I might choose to do this by launching 100 blocks of 1000 threads each. In this case, each block will have 1000 active threads, and 24 inactive threads. My average utilization of thread resources is 1000/1024 = 97.6%.

Now, what if I chose blocks of size 1024? Now I only need to launch 98 blocks. The first 97 of these blocks are fully utilized in terms of thread utilization - every thread is doing some thing useful. The 98th block only has 672 (out of 1024) threads that are doing something useful. The others are explicitly inactive because of a thread check (if (idx < N) ) or other construct in the kernel code. So I have 352 inactive threads in that one block. But my overall average utilization is 100000/100352 = 99.6%

So in the above scenario, it's better to choose "full" threadblocks, evenly divisible by 32.

If you are doing vector add on a vector of length 1000, and you intend to do it in a single threadblock, (both may be bad ideas), then it does not matter whether you choose 1000 or 1024 for your threadblock size.

like image 134
Robert Crovella Avatar answered Sep 17 '22 12:09

Robert Crovella