How does CUDA Thrust compare to a raw kernel?

Tags:

I'm new to GPU programming and am unsure what would lead to the most efficient code. What are the pros/cons of using Thrust vs writing a custom kernel and managing memory yourself?

If it would help to elaborate on what my goal is: I have a large matrix where for each value I need to perform a few vector operations. I know I need dynamic parallelism for this task and currently have a custom kernel to go through the matrix that will call other kernels. I'm considering whether the kernel should be replaced with a Thrust call (e.g. thrust::for_each) and/or whether I should use Thrust inside the kernel for the vector operations.

970

asked Jul 09 '19 22:07

Reg

1 Answers

Over the last ~12 months I've gone from writing predominantly CUDA kernels to predominantly using Thrust, and then back to writing predominantly CUDA kernels. In general, writing your own CUDA kernels should provide better raw performance, but in simpler test cases the difference should be negligible.

Thrust mimics the C++ STL, so it carries many of the same upsides and downsides as the STL. Namely, it's designed to operate on vectors of data in a very generalized way. From that perspective, Thrust is better at some things than CUDA is but shouldn't be seen as a one-size-fits-all solution. Thrust's main advantages are in areas like abstraction and portability; you don't have to think about block sizes, and it's easy to write functors that are equally applicable to data on the device or on the host whereas obviously a CUDA kernel can only operate on device memory. It also has a number of very useful algorithms; it's nice not having to write your own reduction or sort algorithms, as Thrust provides very efficient implementations of these. But under the hood your data access patterns might not easily match what Thrust was designed for, and thrust tends to perform a lot of temporary memory allocations (which in a performance context is often not good; you can hack its memory management model to cache these temporary allocations, but I don't recommend actually doing this, just write kernels instead and take full control of your memory usage yourself).

My preferred work mode right now is to use CUDA for almost everything but dipping into Thrust's algorithms for specific algorithms (e.g. sort), for prototype code, or for code where I'd like the implementation to work equally well on the host or the device.

123

answered Oct 18 '22 22:10

Michael

Related questions
                            
                                NVCC 5.0 and OpenACC
                            
                                CUDA. How to unroll first 32 threads so they will be executed in parallel?
                            
                                cuBLAS synchronization best practices
                            
                                Cuda block/grid dimensions: when to use dim3?
                            
                                Why does my "Hello world" program take almost 10s?
                            
                                Using memset in a CUDA kernel
                            
                                CMake + Cuda: compile cpp files in Cuda-mode (--x=cu)
                            
                                STL thrust multiple vector transform?
                            
                                Where does CUDA allocate the stack frame for kernels?
                            
                                Image Processing on CUDA or OpenCV?
                            
                                CUDA C programming with 2 video cards
                            
                                CUDA random number generating
                            
                                PTX "bit bucket" registers
                            
                                different kernels for different architectures
                            
                                How to read back a CUDA Texture for testing?
                            
                                How to stop Matlab crashing on (wrong) mex-file execution with CUDA functionality
                            
                                Counting occurrences of numbers in a CUDA array
                            
                                PCI-e lane allocation on 2-GPU cards?
                            
                                cudaDeviceSynchronize() error code 77: cudaErrorIllegalAddress
                            
                                Why use SIMD if we have GPGPU? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does CUDA Thrust compare to a raw kernel?

Tags:

cuda

gpu

thrust

Reg

People also ask

1 Answers

Michael

Recent Activity

Donate For Us