Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal integer encoding that still sorts

One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).

The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.

I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.

like image 535
U2EF1 Avatar asked May 20 '12 04:05

U2EF1


2 Answers

Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.

Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.

This paper discussing order-preserving dictionary-based compression also looks interesting.

like image 123
Erick Wong Avatar answered Sep 27 '22 18:09

Erick Wong


There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.

If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.

like image 45
Pavel Radzivilovsky Avatar answered Sep 27 '22 16:09

Pavel Radzivilovsky