Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interaction between global stores and `bar.sync`

Tags:

cuda

ptx

Suppose I have some PTX that looks like this:

st.global.v4.b32 ...
bar.sync

I know that bar.sync will block execution until the st.global.v4.b32 is visible to all threads in the block. My question is: how long does it take until global stores are visible?

In general, global memory stores are "fire and forget": after store instructions hit the store buffer, threads can continue executing while the store buffer asynchronously commits data to global memory.

At what point in the global memory store operation does bar.sync allow threads to continue execution? Is it:

  • Once data hits the store buffer?
  • Once data hits L1 cache?
  • Once data hits L2 cache?
  • Once data hits DRAM?

In other words, what is the latency of a global memory store followed by bar.sync?

Edit: I am asking in the context of A100 and H100 specifically.

like image 668
Elliot Gorokhovsky Avatar asked Dec 28 '25 14:12

Elliot Gorokhovsky


1 Answers

how long does it take until global stores are visible?

How long is a piece of string?
As you already write in your question, it depends on the cache your data lands in. Unless it is evicted, data will stay in the L1 cache and so on throughout the cache hierarchy.

bar.sync is a __syncthreads() and works on a block level, i.e. on a single multiprocessor.
If you are lucky the __syncthreads() can be very fast, it depends on the writes between successive syncs.

A somewhat recent paper for Ampere: Demystifying the Nvidia Ampere Architecture through Microbenchmarking details the time it takes to execute common instructions on Ampere. For Hopper, see: Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

They list:

Location Ampere (paper1) Ampere (paper2) Hopper (paper2)
Global memory 290 466 479
L2 cache 200 262 263
L1 cache 33 38 41
Shared Memory (ld/st) 23/19 ?/29 ?/29

My own experience lines up more with paper1 than with paper2.

If you have access to such a machine, you can easily time it yourself using the following code:

__device__ uint64_t GetGlobalClock() { //volatile + memory to prevent reordering
    //return clock64();
    uint64_t start;
    //do not use clock32!!, this is very slow.
    asm volatile("mov.u64 %0, %%clock64;" : "=l"(start) :: "memory");
    //asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));
    return start;/**/
}

__global__ void TimeSync(int* data, int musthave) { //parameter to trick the optimizer
    //do init
    __syncthreads(); //wait for stuff to stabilize
    for (auto r = 0; r < 5; r++) { //always run a loop, the first run always  takes > 500 cycles.
        const auto StartTime = GetGlobalClock();
        data[threadIdx.x] = musthave * r;    
        __syncthreads();
        const auto EndTime = GetGlobalClock();
        if (threadIdx.x < 32) {
            printf("data[%i] = %i, Time = %i cycles\n", threadIdx.x, data[threadIdx.x], int(EndTime - StartTime));
        }
    }
}

This will measure L1 latency.
You have to ensure eviction from L1 if you want to measure L2 latency, etc.

On my machine, an RTX 3070, I get:

...
data[25] = 20, Time = 43 cycles
data[26] = 20, Time = 43 cycles
data[27] = 20, Time = 43 cycles
...

So 33 cycles for the L1 write plus 8 cycles for the __syncthreads() plus 2 cycles for the clock64().

If you want your threads to line up, but don't want to incur waits due to memory latency, you can perhaps use a spin-wait loop?

__shared__ uint32_t arrive_barrier;
atomicAnd(&arrive_barrier, 0);
constexpr auto blockwarps = 32; 
constexpr auto arrive_mask = -1u >> (32 - blockwarps);
//do work
const auto warpid = threadIdx.x / 32;
while (atomicOr(&arrive_barrier, 1u << warpid) != arrive_mask) {
    __nanosleep(1);
}

Obviously, this will only ensure memory safety with regards to shared mem.

like image 108
Johan Avatar answered Dec 30 '25 14:12

Johan