OpenCL AES Parallelization

Question

I am trying to write some code that does AES decryption for an SSL server. To speed it up I am trying to combine multiple packets together to be decrypted on the GPU at one time.

If I just loop over each packet and I submit each kernel to the gpu and then a read that uses the kernels event for its wait. I then collect together the events for all of the reads and wait on them all at the same time but it seems to just run one block at a time and then do the next block. This is not what I would expect. I would expect that if I queue all of the kernels then I hope that the drivers would try doing as much work as possible in parallel.

Am I missing something? Do I have to specify the global worksize to be the size of all of the packet's blocks together and specify the kernels local size to be the size of each packet's blocks?

This is my code for my OpenCL kernel.

__kernel void decryptCBC( __global const uchar *rkey, const uint rounds, 
    __global const uchar* prev, __global const uchar *data, 
    __global uchar *result, const uint blocks ) {

    const size_t id = get_global_id( 0 );
    if( id > blocks ) return;

    const size_t startPos = BlockSize * id;

    // Create Block
    uchar block[BlockSize];
    for( uint i = 0; i < BlockSize; i++) block[i] = data[startPos+i];

    // Calculate Result
    AddRoundKey( rkey, block, rounds );

    for( uint j = 1; j < rounds; ++j ){
        const uint round = rounds - j;
        InverseShiftRows( block );
        InverseSubBytes( block );
        AddRoundKey( rkey, block, round );
        InverseMixColumns( block );
    }

    InverseSubBytes( block );
    InverseShiftRows( block );
    AddRoundKey( rkey, block, 0 );

    // Store Result
    for( uint i = 0; i < BlockSize; i++ ) {
        result[startPos+i] = block[i] ^ prev[startPos+i];
    }
}

With this kernel, I can beat an 8 core CPU with 125 blocks of data in a single packet. To speed up multiple packets, I attempted to combine together all of the data elements. This involved combining the input data into a single vector and then complications came from the need for each kernel to know where to access within the key leading to two extra arrays containing the number of rounds and the offset of rounds. This turned out to be even slower than the separate execution of a kernel for each packet.

Yann Vernier · Accepted Answer

Consider your kernel as a function doing CBC work. As you've found, its chained nature means the CBC task itself is fundamentally serialized. In addition, a GPU prefers to run 16 threads with identical workloads. That's essentially the size of a single task within a multiprocessor core, of which you tend to have dozens; but the management system can only feed them a few of these tasks overall, and the memory system can rarely keep up with them. In addition, loops are one of the worst uses of the kernel, because GPUs are not designed to do much control flow.

So, looking at AES, it operates on 16 byte blocks, but only in bytewise operations. This will be your first dimension - every block should be worked over by 16 threads (probably the local work size in opencl parlance). Make sure to transfer the block to local memory, where all threads can run in lockstep doing random accesses with very low latency. Unroll everything within an AES block operation, using get_local_id(0) to know which byte each thread operates on. Synchronize with barrier(CLK_LOCAL_MEM_FENCE) in case a workgroup runs on a processor that could run out of lockstep. The key can probably go into constant memory, as this can be cached. The block chaining might be an appropriate level to have a loop, if only to avoid reloading the previous block ciphertext from global memory. Also asynchronous storing of completed ciphertext using async_work_group_copy() may help. It's possible you can make a thread do more work by using vectors, but that probably won't help because of steps like shiftRows.

Basically, if any thread within a group of 16 threads (may vary with architectures) gets any different control flow, your GPU is stalling. And if there aren't enough such groups to fill the pipelines and multiprocessors, your GPU is sitting idle. Until you've very carefully optimized the memory accesses, it won't come close to CPU speeds, and even after that, you'll need to have dozens of packets to process at once to avoid giving the GPU too small workgroups. The issue then is that although the GPU can run thousands of threads, its control structure only handles a few workgroups at any time.

One other thing to beware of; when you're using barriers in a workgroup, every thread in the workgroup must execute the same barrier calls. That means even if you have extra threads running idle (for instance, those decrypting a shorter packet within a combined workgroup) they must keep going through the loop even if they make no memory access.

Steve Blackwell · Answer

It's not entirely clear from your description, but I think there's some conceptual confusion.

Don't loop over each packet and start a new kernel. You don't need to tell OpenCL to start a bunch of kernels. Instead, upload as many packets as you can to the GPU, then run kernel just once. When you specify the workgroup size, that's how many kernels the GPU tries to run simultaneously.

You will need to program your kernels to each look in a different location in data you uploaded to find their packet. For example, if you were going to add two arrays into a third array, your kernel would look like this:

__kernel void vectorAdd(__global const int* a,
                        __global const int* b,
                        __global int* c) {
  int idx = get_global_id(0);
  c[idx] = a[idx] + b[idx];
}

The important part is that each kernel knows index into the array by using its global id. You'll want to do something similar.

OpenCL AES Parallelization

Tags:

c++

aes

opencl

James Sweet

2 Answers

Yann Vernier

Steve Blackwell

Recent Activity

Donate For Us

OpenCL AES Parallelization

Tags:

c++

aes

opencl

James Sweet

2 Answers

Yann Vernier

Steve Blackwell

Related questions

Recent Activity

Donate For Us