CUDA, mutex and atomicCAS()

Tags:

Recently I started to develop on CUDA and faced with the problem with atomicCAS(). To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.

The device code below runs on 1 block and several threads.

__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
    int i = threadIdx.x;
    ...

    do 
    {
        atomicCAS(mutex, 0, 1 + i);
    }
    while (*mutex != i + 1);

    //critical section
    //do some manipulations with objects in device memory

    *mutex = 0;

    ...
}

When first thread executes

atomicCAS(mutex, 0, 1 + i);

mutex is 1. After that first thread changes its status from Active to Inactive, and line

*mutex = 0;

is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.

The host part of code:

...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);

I use Visual Studio 2012 with CUDA 5.5.

The device is NVidia GeForce GT 240 with compute capability 1.2.

Thanks in advance.

UPD: After some time working on my diploma project this spring, I found a solution for critical section on cuda. This is a combination of lock-free and mutex mechanisms. Here is working code. Used it to impelment atomic dynamic-resizable array.

// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex) 
{
    bool isSet = false; 
    do 
    {
        if (isSet = atomicCAS(mutex, 0, 1) == 0) 
        {
            // critical section goes here
        }
        if (isSet) 
        {
            mutex = 0;
        }
    } 
    while (!isSet);
}

241

asked Jan 24 '14 20:01

Alex

1 Answers

The loop in question

do 
{
    atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);

would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.

However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]

Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.

A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.

[1] CUDA Best Practices Guide - Branching and Divergence

Avoid different execution paths within the same warp.

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

127

answered Sep 21 '22 02:09

Philip Cho

Related questions
                            
                                Is 'this' guaranteed to point to the start of an object in C++?
                            
                                C++ primary expressions - Is it primary expression or not?
                            
                                The ref- and cv-stripping property of `auto`.
                            
                                why (0+0i)^{0} == (nan, nan) in c++
                            
                                Why does conversion to reference interfere with conversion to bool?
                            
                                How to write VLC plugin that can interact with the operating system
                            
                                Converting glm::lookat matrix to quaternion and back
                            
                                data member 'vec' cannot be a member template
                            
                                Why are C++ exception specifications not checked at compile-time?
                            
                                How does getInstance() work?
                            
                                Proper way to generate a random float given a binary random number generator?
                            
                                How to friend a specific template specialization?
                            
                                Android only game in OpenGL: performance in C++ (NDK) vs Java (Dalvik) [closed]
                            
                                c++ wrapping types for semantic
                            
                                Difference between T[N] and T[] in template specializations?
                            
                                How do I capture a smart pointer in a lambda?
                            
                                Why does member `float x` get initialized with `0.` for the objects `a` and `b` in main()? [duplicate]
                            
                                Is there any way to get the caller of the CallExpr* in VisitCallExpr method with clang?
                            
                                Why do template template parameters with constraints require stricter arguments?
                            
                                LCOV branches at the end of a function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA, mutex and atomicCAS()

Tags:

c++

atomic

mutex

cuda

Alex

People also ask

1 Answers

Philip Cho

Recent Activity

Donate For Us