When can threads of a warp get scheduled independently on Volta+?

Question

Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:

Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures

From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.

Now, consider listing 8 from this blog post, reproduced below:

unsigned tid = threadIdx.x;
int v = 0;

v += shmem[tid+16]; __syncwarp();  // 1
shmem[tid] = v;     __syncwarp();  // 2
v += shmem[tid+8];  __syncwarp();  // 3
shmem[tid] = v;     __syncwarp();  // 4
v += shmem[tid+4];  __syncwarp();  // 5
shmem[tid] = v;     __syncwarp();  // 6
v += shmem[tid+2];  __syncwarp();  // 7
shmem[tid] = v;     __syncwarp();  // 8
v += shmem[tid+1];  __syncwarp();  // 9
shmem[tid] = v;

Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without any of the __syncwarp() calls. This seems to contradict the statement I quote above.

I would appreciate if someone can clarify this confusion?

Robert Crovella · Accepted Answer

From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.

If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:

Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures

This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.

For the remainder of your question, I guess instead of this:

I would appreciate if someone can clarify this confusion?

You are asking:

Why is the syncwarp needed?

Two reasons:

As stated near the top of that post:

Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp

A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.

The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.

The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.

When can threads of a warp get scheduled independently on Volta+?

Tags:

cuda

apnkpr

1 Answers

Robert Crovella

Recent Activity

Donate For Us

When can threads of a warp get scheduled independently on Volta+?

Tags:

cuda

apnkpr

1 Answers

Robert Crovella

Related questions

Recent Activity

Donate For Us