I have been trying to understand how data broadcasting works. In terms of this fact, I have designed two distinct kernel (in the aspect of reading data from shared memory). I have tried compare the time spent for fetching the data. There are the kernels.
/*
@d_data: data in global memory (1GB array consist of uint32's),
@dummy: a dummy array to prevent nvcc optimizations,
@seed: random variable generated in runtime,
@d_shared_delay_random: output array for storing delay times.
*/
__global__ void shared_latency_random(int32_t* d_data, int32_t* dummy, uint32_t seed, float* d_shared_delay_random) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int32_t shared_data[THREAD_PER_BLOCK];
shared_data[threadIdx.x] = d_data[idx] % THREAD_PER_BLOCK;
__syncthreads();
curandState state;
curand_init(seed + idx, 0, 0, &state);
int32_t start_idx = curand(&state) % THREAD_PER_BLOCK;
clock_t start = clock();
for (int i = 0; i < ITERATIONS; i++) {
start_idx = shared_data[start_idx];
}
clock_t end = clock();
dummy[idx] = start_idx;
d_shared_delay_random[idx] = static_cast<float>(end - start) / ITERATIONS;
}
/*
@d_data: data in global memory (1GB array consist of uint32's),
@dummy: a dummy array to prevent nvcc optimizations,
@d_shared_delay_broadcast: output array for storing delay times.
*/
__global__ void shared_latency_broadcast(int32_t* d_data, int32_t* dummy, float* d_shared_delay_broadcast) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int32_t shared_data[THREAD_PER_BLOCK];
shared_data[threadIdx.x] = d_data[idx] % THREAD_PER_BLOCK;
__syncthreads();
int value = 0;
clock_t start = clock();
for (int i = 0; i < ITERATIONS; i++) {
value = shared_data[value];
}
clock_t end = clock();
dummy[idx] = value;
d_shared_delay_broadcast[idx] = static_cast<float>(end - start) / ITERATIONS;
}
As you can notice shared_latency_broadcast represent data broadcasting between threads in warp. In the other hand, in shared_latency_random kernel each threads generates random index then starts pointer chasing from that address. When I take a look the delay results, they are as follows (in average) :
| Random Access Delay | Broadcast Access Delay |
|---|---|
| 22.025 | 30.022 |
Why there is a difference? Is not Broadcast Delay supposed to be lower when I consider the quote said by Robert :
Two requests to the same bank and the same 32-bit location in that bank do not create a bank conflict. They invoke the broadcast rule which says that the request for that bank can be handled in a single cycle, regardless of how many threads in the warp in that cycle are requesting data from that location.
I run these kernels with 16 blocks each contains 64 threads. There is compiler explorer outcome for kernels : https://godbolt.org/z/TP73bWEY4
The difference in timings is not caused by the access pattern.
The Random access does not change much on this. It may get fewer bank conflicts (if all random numbers are < 32 for example -> no bank conflicts).
However for your use case I doubt that bank conflicts are a factor.
There *is* a difference in generated code however:
The random version (with arbitrary line numbers) generates:
shared_latency_random:
1 CS2R R4, SR_CLOCKLO //start = clock()
2 .L_x_0:
3 LDS.U R7, [R7.X4] //load from shared mem, with implicit address calc
4 IADD3 R2, R2, 0x1, RZ //i++
5 ISETP.NE.AND P0, PT, R2, 0x14, PT //done yet?
6 @P0 BRA `(.L_x_0) //for loop
7 CS2R R2, SR_CLOCKLO //end = clock
Whereas the broadcast version generates:
shared_latency_broadcast:
10 CS2R R4, SR_CLOCKLO //start = clock
11 .L_x_2:
12 SHF.L.U32 R7, R7, 0x2, RZ //address calc, note that R7 depends on previous load
13 IADD3 R6, R6, 0x1, RZ //i++
14 ISETP.NE.AND P0, PT, R6, 0x14, PT //loop done yet?
15 LDS.U R7, [R7] //load from shared memory
16 @P0 BRA `(.L_x_2)
17 CS2R R2, SR_CLOCKLO //end = clock
The difference between random and broadcast access is due to the fact that the generated broadcast code carries an additional dependent instruction in the loop (R7 in line 12 depends on the shared load into R7 in line 15). This causes a massive slowdown of (29995 - 25998) / 1000 = 4 cycles per loop iteration.
To be fair, I think this is a bug in the nvcc compiler. There is no need for this extra instruction to be there.
The random code has only a single instruction dependent on itself in line 3. That means the broadcast incurs the dependency delay twice and random incurs it only once.
If you comment out the unroll, nvcc will unroll the loop getting rid of the loop overhead, this shows an even bigger time difference: (29977 - 22001) / 1000 ≈ 8 cycles. It turns out a loop overhead of 4 cycles was hiding in the dependency delay.
See my version of your Godbolt for the demo: https://godbolt.org/z/zGzhTGqvW
But why?
The difference is that in the broadcast version all threads start with the same value for value: int value = 0;.
If I change the code block to:
auto volatile value = idx; //was: auto volatile value = 0;
const int volatile start = clock64();
//#pragma unroll 1
for (int i = 0; i < ITERATIONS; i++) {
value = shared_data[value];
}
const int volatile end = clock64();
Then the total time for broadcast is 21986 vs random: 22001, meaning statistically insignificant.
Why there is a difference?
There should be no difference, one has all the threads hammering a single shared int, the other code spreads out the memory accesses across threads. Neither of these access patterns have any penalty associated with them. However, due to the erroneous extra instruction generated by nvcc in the broadcast version, that version takes more time.
volatile all the things when timing tiny snippets
Note that if you want to make sure your timings are accurate, you need to make all relevant variables volatile, otherwise the compiler may reorder instructions. Which may throw off your timings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With