Suppose I have some PTX that looks like this:
st.global.v4.b32 ...
bar.sync
I know that bar.sync will block execution until the st.global.v4.b32 is visible to all threads in the block. My question is: how long does it take until global stores are visible?
In general, global memory stores are "fire and forget": after store instructions hit the store buffer, threads can continue executing while the store buffer asynchronously commits data to global memory.
At what point in the global memory store operation does bar.sync allow threads to continue execution? Is it:
In other words, what is the latency of a global memory store followed by bar.sync?
Edit: I am asking in the context of A100 and H100 specifically.
how long does it take until global stores are visible?
How long is a piece of string?
As you already write in your question, it depends on the cache your data lands in. Unless it is evicted, data will stay in the L1 cache and so on throughout the cache hierarchy.
bar.sync is a __syncthreads() and works on a block level, i.e. on a single multiprocessor.
If you are lucky the __syncthreads() can be very fast, it depends on the writes between successive syncs.
A somewhat recent paper for Ampere: Demystifying the Nvidia Ampere Architecture through Microbenchmarking details the time it takes to execute common instructions on Ampere. For Hopper, see: Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
They list:
| Location | Ampere (paper1) | Ampere (paper2) | Hopper (paper2) |
|---|---|---|---|
| Global memory | 290 | 466 | 479 |
| L2 cache | 200 | 262 | 263 |
| L1 cache | 33 | 38 | 41 |
| Shared Memory (ld/st) | 23/19 | ?/29 | ?/29 |
My own experience lines up more with paper1 than with paper2.
If you have access to such a machine, you can easily time it yourself using the following code:
__device__ uint64_t GetGlobalClock() { //volatile + memory to prevent reordering
//return clock64();
uint64_t start;
//do not use clock32!!, this is very slow.
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
//asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));
return start;/**/
}
__global__ void TimeSync(int* data, int musthave) { //parameter to trick the optimizer
//do init
__syncthreads(); //wait for stuff to stabilize
for (auto r = 0; r < 5; r++) { //always run a loop, the first run always takes > 500 cycles.
const auto StartTime = GetGlobalClock();
data[threadIdx.x] = musthave * r;
__syncthreads();
const auto EndTime = GetGlobalClock();
if (threadIdx.x < 32) {
printf("data[%i] = %i, Time = %i cycles\n", threadIdx.x, data[threadIdx.x], int(EndTime - StartTime));
}
}
}
This will measure L1 latency.
You have to ensure eviction from L1 if you want to measure L2 latency, etc.
On my machine, an RTX 3070, I get:
...
data[25] = 20, Time = 43 cycles
data[26] = 20, Time = 43 cycles
data[27] = 20, Time = 43 cycles
...
So 33 cycles for the L1 write plus 8 cycles for the __syncthreads() plus 2 cycles for the clock64().
If you want your threads to line up, but don't want to incur waits due to memory latency, you can perhaps use a spin-wait loop?
__shared__ uint32_t arrive_barrier;
atomicAnd(&arrive_barrier, 0);
constexpr auto blockwarps = 32;
constexpr auto arrive_mask = -1u >> (32 - blockwarps);
//do work
const auto warpid = threadIdx.x / 32;
while (atomicOr(&arrive_barrier, 1u << warpid) != arrive_mask) {
__nanosleep(1);
}
Obviously, this will only ensure memory safety with regards to shared mem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With