I have a low-resource embedded system with a graphical user interface. The interface requires font data. To conserve read-only memory (flash), the font data needs to be compressed. I am looking for an algorithm for this purpose.
Properties of the data to be compressed
- transparency data for a rectangular pixel map with 8 bits per pixel
- there are typically around 200..300 glyphs in a font (typeface sampled in certain size)
- each glyph is typically from 6x9 to 15x20 pixels in size
- there are a lot of zeros ("no ink") and somewhat less 255's ("completely inked"), otherwise the distribution of octets is quite even due to the nature of anti-aliasing
Requirements for the compression algorithm
- The important metrics for the decompression algorithm is the size of the data plus the size of the algorithm (as they will reside in the same limited memory).
- There is very little RAM available for the decompression; it is possible to decompress the data for a single glyph into RAM but not much more.
- To make things more difficult, the algorithm has to be very fast on a 32-bit microcontroller (ARM Cortex-M core), as the glyphs need to be decompressed while they are being drawn onto the display. Ten or twenty machine cycles per octet is ok, a hundred is certainly too much.
- To make things easier, the complete corpus of data is known a priori, and there is a lot of processing power and memory available during the compression phase.
Conclusions and thoughts
- The naïve approach of just packing each octet by some variable-length encoding does not give good results due to the relatively high entropy.
- Any algorithm taking advantage of data decompressed earlier seems to be out of question as it is not possible to store the decompressed data of other glyphs. This makes LZ algorithms less efficient as they can only reference to a small amount of data.
- Constraints on the processing power seem to rule out most bitwise operations, i.e. decompression should handle the data octet-by-octet. This makes Huffman coding difficult and arithmetic coding impossible.
- The problem seems to be a good candidate for static dictionary coding, as all data is known beforehand, and the data is somewhat repetitive in nature (different glyphs share same shapes).
Questions
- How can a good dictionary be constructed? I know finding the optimal dictionary for certain data is a np complete problem, but are there any reasonably good approximations? I have tried the zstandard's dictionary builder, but the results were not very good.
- Is there something in my conclusions that I've gotten wrong? (Am I on the wrong track and omitting something obvious?)
Best algorithm this far
Just to give some background information, the best useful algorithm I have been able to figure out is as follows:
- All samples in the font data for a single glyph are concatenated (flattened) into a one-dimensional array (vector, table).
- Each sample has three possible states: 0, 255, and "something else".
- This information is packed five consecutive samples at a time into a 5-digit base-three number (0..3^5).
- As there are some extra values available in an octet (2^8 = 256, 3^5 = 243), they are used to signify longer strings of 0's and 255's.
- For each "something else" value the actual value (1..254) is stored in a separate vector.
This data is fast to decompress, as the base-3 values can be decoded into base-4 values by a smallish (243 x 3 = 729 octets) lookup table. The compression ratios are highly dependent on the font size, but with my typical data I can get around 1:2. As this is significantly worse than LZ variants (which get around 1:3), I would like to try the static dictionary approach.
Of course, the usual LZ variants use Huffman or arithmetic coding, which naturally makes the compressed data smaller. On the other hand, I have all the data available, and the compression speed is not an issue. This should make it possible to find much better dictionaries.
Due to the nature of the data I could be able to use a lossy algorithm, but in that case the most likely lossy algorithm would be reducing the number of quantization levels in the pixel data. That won't change the underlying compression problem much, and I would like to avoid the resulting bit-alignment hassle.
Which algorithm is best for data compression?
This is a basic example of run-length encoding; there are many schemes to reduce file size by eliminating redundancy. The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage.
What is the fastest data compression algorithm?
The fastest algorithm, lz4, results in lower compression ratios; xz, which has the highest compression ratio, suffers from a slow compression speed.
Which algorithm is used for compression?
A compression algorithm is often called compressor and the decompression algorithm is called decompressor.
What are the 2 methods of compression?
There are two types of compression: lossless and lossy.
You can consider using something already developed for a scenario similar to Yours
https://github.com/atomicobject/heatshrink
https://spin.atomicobject.com/2013/03/14/heatshrink-embedded-data-compression/