Optimal integer encoding that still sorts

Question

One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).

The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.

I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.

Erick Wong · Accepted Answer

Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.

Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.

This paper discussing order-preserving dictionary-based compression also looks interesting.

Pavel Radzivilovsky · Answer

There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.

If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.

Optimal integer encoding that still sorts

Tags:

comparison

math

compression

string-comparison

U2EF1

2 Answers

Erick Wong

Pavel Radzivilovsky

Recent Activity

Donate For Us

Optimal integer encoding that still sorts

Tags:

comparison

math

compression

string-comparison

U2EF1

2 Answers

Erick Wong

Pavel Radzivilovsky

Related questions

Recent Activity

Donate For Us