We have a database encoded with Huffman coding. The aim here is to copy on the GPU it with its associated decoder; then on the GPU, decod the database and do stuff on this decoded database without copying back it on the CPU.
I am far to be a Huffman specialist, but the few I know shows that it seems to be an algorithm essentially based on control structures. With the basic algorithm, I am afraid that there will be a lot of serialized operations.
My 2 questions are:
I see other constraints, but they are not critical: - GPU could not be very efficient to handle tree: binary tree can be stored in a classical array - workload could be difficult to balance: we'll see after
The problem with Huffman coding is that you can't fast-forward. ie: you have to decode bit by bit, linearly.
As such it's not ideal for parallelism.
If you can decide on the encoding, you could perfectly encode chunk by chunk so as to be able to decode each chunk independently.
Yes you can do huffman decoding in parallel and so you can get advantages in a GPU - provided memory is not an issue.
For the discussion below I'll talk about the huffman tree, and the huffman output - the output are the compressed symbols that need to be looked up in the huffman tree to be decoded.
The huffman algorithm requires that you have a huffman tree for decoding - that tree can be large. You can get around this by using a small huffman tree that fits on local memory in a GPU - but this will affect the compression efficiency of the algorithm. E.g. you could limit the tree to the best 2^n nodes for as much as your gpu processors allow. ( e.g. use a tree limited to say 1024 nodes.
If you don't limit the huffman tree such that you can fit one copy in local storage on each gpu then you won't really get the parallelism you expect because all the gpu processors will be blocked accessing memory all reading the same shared tree.
The huffman output the symbols are packed in a variable number of bits. There is no way if you start in the middle of the output to know if you are on a symbol boudary. But you can create your own boundaries. For example in the output you could just force the alignment of symbols every x words to be word aligned. Then you know that you can start decoding on any multiple of x word in the output, and send that block to a GPU processing node, along with the appropriate tree.
You don't have to use just one tree- but one tree per block may be overkill also. That is if you have one tree per block you'll severly cut into the compression efficiency if the blocks are small.
So you can try looking at the similarity of blocks and encode similar blocks with the same tree, and store a tree index per block. E.g. you may have 10000 blocks in the output but just 50 1024-node trees. Then you send one block, and one tree to each GPU processing node to decode in parallel.
The key to making it fast is that each GPU processing node works on local memory only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With