Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

optimizing byte-pair encoding

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.

The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.

The runtime of my trivial implementation was less than stellar, however.

How can this be optimized? Is it possible to do it in a single pass?

like image 490
Will Avatar asked Jan 19 '10 11:01

Will


2 Answers

There is an O(n) version of byte-pair encoding which I describe here. I am getting a compression speed of ~200kB/second in Java.

like image 149
Stefan Reich Avatar answered Nov 15 '22 21:11

Stefan Reich


This is a summary of my progress so far:

Googling found this little report that links to the original code and cites the source:

Philip Gage, titled 'A New Algorithm for Data Compression', that appeared in 'The C Users Journal' - February 1994 edition.

The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.

That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.

My test data is enwik8 from the Hutter Prize.

|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2          | 1.24            | //The current version in the large text benchmark
| bpe_c          | 1.07            | //The original version by Gage, using a hashtable
| bpev3          | 0.25            | //Uses a list, custom sort, less memcpy
|----------------|-----------------|

bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.

As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.

The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.

Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.

Its sill multi-pass though. And so its not a generally competitive algorithm.

If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.

There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

like image 31
Will Avatar answered Nov 15 '22 22:11

Will