Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does CUDA Thrust compare to a raw kernel?

Tags:

cuda

gpu

thrust

I'm new to GPU programming and am unsure what would lead to the most efficient code. What are the pros/cons of using Thrust vs writing a custom kernel and managing memory yourself?

If it would help to elaborate on what my goal is: I have a large matrix where for each value I need to perform a few vector operations. I know I need dynamic parallelism for this task and currently have a custom kernel to go through the matrix that will call other kernels. I'm considering whether the kernel should be replaced with a Thrust call (e.g. thrust::for_each) and/or whether I should use Thrust inside the kernel for the vector operations.

like image 970
Reg Avatar asked Jul 09 '19 22:07

Reg


People also ask

What is CUDA thrust?

Thrust is a C++ template library for CUDA based on the Standard Template Library (STL). Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface that is fully interoperable with CUDA C.

What are the CUDA kernel limitations?

There is a maximum number of CUDA instructions per kernel: 2 million before CC 2.0, 512 million after. OK, thank you.

What are CUDA kernels?

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.

Does CUDA support STD vector?

(There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexpr ness is effectively limited.)


1 Answers

Over the last ~12 months I've gone from writing predominantly CUDA kernels to predominantly using Thrust, and then back to writing predominantly CUDA kernels. In general, writing your own CUDA kernels should provide better raw performance, but in simpler test cases the difference should be negligible.

Thrust mimics the C++ STL, so it carries many of the same upsides and downsides as the STL. Namely, it's designed to operate on vectors of data in a very generalized way. From that perspective, Thrust is better at some things than CUDA is but shouldn't be seen as a one-size-fits-all solution. Thrust's main advantages are in areas like abstraction and portability; you don't have to think about block sizes, and it's easy to write functors that are equally applicable to data on the device or on the host whereas obviously a CUDA kernel can only operate on device memory. It also has a number of very useful algorithms; it's nice not having to write your own reduction or sort algorithms, as Thrust provides very efficient implementations of these. But under the hood your data access patterns might not easily match what Thrust was designed for, and thrust tends to perform a lot of temporary memory allocations (which in a performance context is often not good; you can hack its memory management model to cache these temporary allocations, but I don't recommend actually doing this, just write kernels instead and take full control of your memory usage yourself).

My preferred work mode right now is to use CUDA for almost everything but dipping into Thrust's algorithms for specific algorithms (e.g. sort), for prototype code, or for code where I'd like the implementation to work equally well on the host or the device.

like image 123
Michael Avatar answered Oct 18 '22 22:10

Michael