cudaDeviceSynchronize() waits to finish only in current CUDA context or in all contexts?

Tags:

I use CUDA 6.5 and 4 x GPUs Kepler.

I use multithreading, CUDA runtime API and access to the CUDA contexts from different CPU threads (by using OpenMP - but it does not really matter).

When I call cudaDeviceSynchronize(); will it wait for kernel(s) to finish only in current CUDA context which selected by the latest call cudaSetDevice(), or in all CUDA contexts?
If it will wait for kernel(s) to finish in all CUDA contexts, then it will wait in all CUDA contexts which used in current CPU thread (in example CPU thread_0 will wait GPUs: 0 and 1) or generally all CUDA contexts (CPU thread_0 will wait GPUs: 0, 1, 2 and 3)?

Following code:

// For using OpenMP requires to set:
// MSVS option: -Xcompiler "/openmp"
// GCC option: –Xcompiler –fopenmp
#include <omp.h>

int main() {

    // execute two threads with different: omp_get_thread_num() = 0 and 1
    #pragma omp parallel num_threads(2)
    {
        int omp_threadId = omp_get_thread_num();

        // CPU thread 0
        if(omp_threadId == 0) {

            cudaSetDevice(0);
            kernel_0<<<...>>>(...);
            cudaSetDevice(1);
            kernel_1<<<...>>>(...);

            cudaDeviceSynchronize(); // what kernel<>() will wait?

        // CPU thread 1
        } else if(omp_threadId == 1) {

            cudaSetDevice(2);
            kernel_2<<<...>>>(...);
            cudaSetDevice(3);
            kernel_3<<<...>>>(...);

            cudaDeviceSynchronize(); // what kernel<>() will wait?

        }
    }

    return 0;
}

995

asked Nov 10 '14 10:11

Alex

1 Answers

When I call cudaDeviceSynchronize(); will it wait for kernel(s) to finish only in current CUDA context which selected by the latest call cudaSetDevice(), or in all CUDA contexts?

cudaDeviceSynchronize() syncs all streams in the current CUDA context only.

Note: cudaDeviceSynchronize() will only synchronize host with the currently set GPU, if multiple GPUs are in use and all need to be synchronized, cudaDeviceSynchronize() has to be called separately for each one.

Here is a minimal example:

cudaSetDevice(0); cudaDeviceSynchronize();
cudaSetDevice(1); cudaDeviceSynchronize();
...

Source: Pawel Pomorski, slides of "CUDA on multiple GPUs". Linked here.

160

answered Oct 14 '22 12:10

srodrb

Related questions
                            
                                Timer library in C [closed]
                            
                                How to check if a thread is running in the ExecutorService Thread pool
                            
                                high frequency timing .NET
                            
                                Delphi: Should a thread ever be created "not suspended"?
                            
                                Android background jobs for synchronization with a web service
                            
                                Unsafe publication concurrency java [duplicate]
                            
                                Why are only 32 threads running when calling futures in clojure?
                            
                                Locks vs Compare-and-swap
                            
                                How to measure system overload when using GO
                            
                                closing db connection with django's persistent connection in a multi-threaded script
                            
                                What is the best way to send multiple HTTP requests in Python 3? [duplicate]
                            
                                log4net: different logs on different file appenders at runtime
                            
                                Does usage of Thread.Sleep(n) causes performance issues?
                            
                                Basic timer with std::thread and std::chrono
                            
                                Concurrent XmlReader and XmlWriter
                            
                                Lightweight threads in Akka
                            
                                pthread_create() and memory leaks
                            
                                Is ConcurrentDictionary ContainsKey method synched?
                            
                                Thread memory leak
                            
                                ThreadLocal and await

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

cudaDeviceSynchronize() waits to finish only in current CUDA context or in all contexts?

Tags:

multithreading

cuda

gpgpu

nvidia

Alex

People also ask

1 Answers

srodrb

Recent Activity

Donate For Us