Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA streams and context

I am using an application presently that spawns a bunch of pthreads (linux), and each of those creates it's own CUDA context. (using cuda 3.2 right now).

The problem I am having is that it seems like each thread having its own context costs a lot of memory on the GPU. Something like 200MB per thread, so this is really limiting me.

Can I simply create streams in the host thread, pass the stream reference to the worker threads, which would then be able to pass to my CUDA library their stream number, and all work out of the same context?

Does a worker thread automatically know the same CUDA context as it's parent thread?

Thanks

like image 626
Derek Avatar asked Jul 25 '11 18:07

Derek


People also ask

What is a CUDA context?

The context holds all the management data to control and use the device. For instance, it holds the list of allocated memory, the loaded modules that contain device code, the mapping between CPU and GPU memory for zero copy, etc.

What are streams in CUDA?

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.

Is cudaMemcpy blocked?

Most CUDA calls are synchronous (often called “blocking”). An example of a blocking call is cudaMemcpy().

What are the three general section of CUDA program?

To execute any CUDA program, there are three main steps: Copy the input data from host memory to device memory, also known as host-to-device transfer. Load the GPU program and execute, caching data on-chip for performance. Copy the results from device memory to host memory, also called device-to-host transfer.


1 Answers

Each CUDA context does cost quite a bit of device memory, and their resources are strictly partitioned from one another. For example, device memory allocated in context A cannot be accessed by context B. Streams also are valid only in the context in which they were created.

The best practice would be to create one CUDA context per device. By default, that CUDA context can be accessed only from the CPU thread that created it. If you want to access the CUDA context from other threads, call cuCtxPopCurrent() to pop it from the thread that created it. The context then can be pushed onto any other CPU thread's current context stack, and subsequent CUDA calls would reference that context.

Context push/pop are lightweight operations and as of CUDA 3.2, they can be done in CUDA runtime apps. So my suggestion would be to initialize the CUDA context, then call cuCtxPopCurrent() to make the context "floating" unless some threads wants to operate it. Consider the "floating" state to be the natural one - whenever a thread wants to manipulate the context, bracket its usage with cuCtxPushCurrent()/cuCtxPopCurrent().

like image 100
ArchaeaSoftware Avatar answered Oct 03 '22 05:10

ArchaeaSoftware