Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

high performance prefix sum / scan function in CUDA, looking for thrust, cuDPP library alterative [closed]

Tags:

cuda

thrust

cudpp

I'm looking for high performance multiscan / multi prefix-sum (many rows in a one kernel execution) function for my project in CUDA.

I've tried the one from Thrust library but it's a way too slow. Also thrust crash after being compiled with nvcc debug flags (-g -G).

After my failure with Thrust I focused on cuDPP library which used to be a part of CUDA toolkit. The cuDPP performance is really good but the library is not up to date with latest cuda 5.5 and there are some global memory violation issues in cudppMultiScan() function while debugging with memory checker. (cuda 5.5, nsight 3.1, visual studio 2010, gtx 260 cc 1.3)

Does anybody have any idea what to use instead of these two libraries?

R.

like image 640
user1946472 Avatar asked Sep 01 '13 16:09

user1946472


2 Answers

These libraries, especially thrust, try to be as generic as possible and optimization often requires specialization: For example a specialization of an algorithm can use shared memory for fundamental types (like int or float) but the generic version can't. It happens that for a particular situation a specialization is missing!

It's a good idea to use these well tested generic libraries as much as possible but sometimes, for some performance critical sections, your own implementation is an option to consider.

In your situation you want many scans in parallel for different rows. A good implementation would not run the scan separately for different rows: It would have the same kernel call running simultaneously for all elements of all the rows. Depending on its index, a thread can know which row it's processing and will ignore all data out of the row.

Such specialization requires a functor that returns an absorbing value that prevent mixing rows. Still, your own careful implementation would be likely way faster.

like image 79
a.lasram Avatar answered Sep 18 '22 04:09

a.lasram


To write your own prefix scan, you may refer to

  1. The scan example of the CUDA SDK;
  2. Chapter 13 of N. Wilt, "The CUDA Handbook";
  3. Chapter 6 of S. Cook, "CUDA Programming, A Developer's Guide to Parallel Computing with GPUs";
  4. Parallel Prefix Sum (Scan) with CUDA.

To do multi prefix-sum you can launch many times the same kernel (as suggested by a.lasram) or try to achieve cuncurrency by CUDA streams, although I do not know it this will effectively work for your card.

like image 26
Vitality Avatar answered Sep 20 '22 04:09

Vitality