Is there any relationship between blockIdx and the order in which thread blocks are executed on the GPU device? 
My motivation is that I have a kernel in which multiple blocks will read from the same location in global memory, and it would be nice if these blocks would run concurrently (because L2 cache hits are nice). In deciding how to organize these blocks into a grid, would it be safe to say that blockIdx.x=0 is more likely to run concurrently with blockIdx.x=1 than with blockIdx.x=200? And that I should try to assign consecutive indices to blocks that read from the same location in global memory?
To be clear, I'm not asking about inter-block dependencies (as in this question) and the thread blocks are completely independent from the point of view of program correctness. I'm already using shared memory to broadcast data within a block, and I can't make the blocks any larger.
EDIT: Again, I am well aware that
Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series.
and the blocks are fully independent---they can run in any order and produce the same output. I am just asking if the order in which I arrange the blocks into a grid will influence which blocks end up running concurrently, because that does affect performance via L2 cache hit rate.
I found a writeup in which a CS researcher used micro-benchmarking to reverse engineer the block scheduler on a Fermi device:
http://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html
I adapted his code to run on my GPU device (GTX 1080, with the Pascal GP104 GPU) and to randomize the runtimes.
Each block contains only 1 thread, and is launched with enough shared memory that only 2 blocks can be resident per SM. The kernel records its start time (obtained via clock64()) and then runs for a random amount of time (the task, appropriately enough, is generating random numbers using the multiply-with-carry algorithm).
The GTX 1080 is comprised of 4 Graphics Processing Clusters (GPCs) with 5 streaming multiprocessors (SM) each. Each GPC has its own clock, so I used the same method described in the link to determine which SMs belonged to which GPCs and then subtract a fixed offset to convert all of the clock values to the same time zone.
For a 1-D block grid, I found that the blocks were indeed launched in consecutive order:

We have 40 blocks starting immediately (2 blocks per SM * 20 SMs) and the subsequent blocks start when the previous blocks end.
For 2-D grids, I found the same linear-sequential order, with blockIdx.x being the fast dimension and blockIdx.y the slow dimension:

NB: I made a terrible typo when labeling these plots. All instances of "threadIdx" should be replaced with "blockIdx".
And for a 3-d block grid:

For a 1-D grid, these results match what Dr. Pai reported in the linked writeup. For 2-D grids, however, I did not find any evidence for a space-filling curve in block execution order, so this may have changed somewhere between Fermi and Pascal.
And of course, the usual caveats with benchmarking apply, and there's no guarantee that this isn't specific to a particular processor model.
For reference, here's a plot showing the results for random vs. fixed runtimes:

The fact that we see this trend with randomized runtimes gives me more confidence that this is a real result and not just a quirk of the benchmarking task.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With