One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).
The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.
I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.
Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.
Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.
This paper discussing order-preserving dictionary-based compression also looks interesting.
There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.
If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With