Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cuda threadfence

I'm writing a code that must perform an inter-block synchronization (the sum of N dimensions and other memory transfer operations). When I increment the dimension of the problem the result is wrong.

I am making the synchronization with __threadfence() and the first dimensions(N<192) that's ok, but if I insert other __threadfence() in the piece of code, the result is correct for more dimensions.

One threadfence() is not sufficient to synchronize? Aditional, the data result is used in the same block.

In the Programming guide the information indicate thatthreadfence wait for all memory spaces are ready (shared and global)

like image 688
MariaDavila Avatar asked Dec 20 '22 18:12

MariaDavila


2 Answers

There is no nice way to perform synchronization between blocks. You can either have a hacky approach with spin-waiting and eating up your GPU memory bandwidth, or you can terminate your kernel and start a new one.

__threadfence() is not for synchronization between blocks. __threadfence() is used to halt the current thread until all previous writes to shared and global memory are visible by other threads. It does not halt nor affect the position of other threads though!

You can check these questions:

  • cuda block synchronization
  • CUDA __threadfence()
like image 94
CygnusX1 Avatar answered Dec 29 '22 11:12

CygnusX1


Cooperative groups will allow for synchronization between different blocks in the same kernel. It's really easy to use now, too.

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

// and then in your code
cg::grid_group grid = cg::this_grid();
grid.sync(); // All threads in all blocks must run this line

Every thread in every block in the entire kernel must run that grid.sync(). Execution of the next line will continue only after all threads have run that one.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups

like image 33
Eyal Avatar answered Dec 29 '22 11:12

Eyal