Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient bucket-sort on GPU

For a current OpenCL GPGPU project, I need to sort elements in an array according to some key with 64 possible values. I need the final array to have all elemens with the same key to be contiguous. It's sufficient to have an associative array new_index[old_index] as the output of this task.

I split the task into two parts. First, I count for each possible key (bucket) the number of elements with this key (that go into that bucket). I scan this array (generate a prefix sum) which indicates the new index range of the elements for each bucket, like "start" indices per bucket.

The second step will then have to assign each element a new index. If I was to implement this on a CPU, the algorithm would be something like this:

for all elements e:
    new_index[e] = bucket_start[bucket(e)]++

Of course, this doesn't work on the GPU. Every item needs to access the bucket_start array in read-write mode which is essentially a synchronization between all work items, the worst we can do.

An idea is to put some computation in work groups. But I'm not sure how this should be done exactly, since I'm not experienced in GPGPU computing.

In global memory, we have the bucket start array initialized with the prefix sum as above. Access to this array is "mutexed" with an atomic int. (I'm new to this, so maybe mixing some words here.)

Every work group is implicitly assigned a part of the input element array. It uses a local bucket array containing the new indices, relative to a (global) bucket start which we do not yet know. After one of these "local buffers" is full, the work group has to write the local buffers into global array. For this, it locks access to the global bucket start array, increments these values by the current local bucket sizes, unlocks, and then can write the result into the global new_index array (by adding the according offset). This process is repeated until all assigned elements are processed.

Two questions arise:

  1. Is this a good approach? I know that reading and writing from/to global memory is most likely the bottleneck here, especially since I'm trying to acquire synchronized access to (at least only a small part of) the global memory. But maybe there is a much better approach to do this, maybe using kernel decomposition. Note that I try to avoid reading back data from GPU to CPU during kernels (to avoid an OpenCL command queue flush, which is also bad as I was tought).

  2. In the algorithm design above, how do I implement the locking mechanism? Will something like the following code work? In particular, I expect problems when the hardware executes work items "truly parallel" in SIMD groups, like Nvidia "warps". In my current code, all items of a work group would try to acquire the lock in an SIMD fashion. Should I restrict this to the first work item only? And use barriers to keep them locally in sync?

    #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
    
    __kernel void putInBuckets(__global uint *mutex,
                               __global uint *bucket_start,
                               __global uint *new_index)
    {
        __local bucket_size[NUM_BUCKETS];
        __local bucket[NUM_BUCKETS][LOCAL_MAX_BUCKET_SIZE]; // local "new_index"
    
        while (...)
        {
            // process a couple of elements locally until a local bucket is full
            ...
    
            // "lock"
            while(atomic_xchg(mutex, 1)) {
            }
    
            // "critical section"
            __local uint l_bucket_start[NUM_BUCKETS];
            for (int b = 0; b < NUM_BUCKETS; ++b) {
                l_bucket_start[b] = bucket_start[b]; // where should we write?
                bucket_start[b] += bucket_size[b];   // update global offset
            }
    
            // "unlock"
            atomic_xchg(mutex, 0);
    
            // write to global memory by adding the offset
            for (...)
                new_index[...] = ... + l_bucket_start[b];
        }
    }
    
like image 333
leemes Avatar asked May 27 '13 23:05

leemes


1 Answers

Firstly never try and implement a locking algorithm on GPU. It will deadlock and stall. This is because a GPU is a SIMD device and the threads don't execute independently as on a CPU. A GPU executes a set threads called a WARP/WaveFront synchronously. So if one thread in the Wave Front is stalled it stalls all other threads in the Wave Front. If the unlock thread is in a stalled Wave Front it will NOT execute and unlock the the mutex.

Atomic operations are OK.

What you should consider is a lock free approach. See this paper for an explanation and sample CUDA code: http://www.cse.iitk.ac.in/users/mainakc/pub/icpads2012.pdf/

It describes lock free hash tables, linked list and skip lists with some sample CUDA code.

A suggested approach is to create a two level data structure.

The first level is a lock free skip list. Each skip list entry has the second level structure of a lock free linked list for duplicate values. And an atomic count of the number of entries.

The insert method

1) Generate 64 bucket key 2) Find key in Skip list 3) If not found insert into the skip list 4) Insert data in to linked list 5) increment the atomic counter for this bucket

After insertion prefix sum all the counters of the skip list buckets so you find the offset of the out put.

like image 87
Tim Child Avatar answered Sep 25 '22 23:09

Tim Child