Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Space-efficient way to encode numbers as sortable strings

Starting with a list of integers the task is to convert each integer into a string such that the resulting list of strings will be in numeric order when sorted lexicographically.

This is needed so that a particular system that is only capable of sorting strings will produce an output that is in numeric order.

Example:

Given the integers

1, 23, 3

we could convert the to strings like this:

"01", "23", "03"

so that when sorted they become:

"01", "03", "23"

which is correct. A wrong result would be:

"1", "23", "3"

because that list is sorted in "string order", not in numeric order.

I'm looking for something more efficient than the simple zero-padding scheme. In order to cover all possible 32 bit integers we'd need to pad to 10 digits which is inefficient.

like image 316
usr Avatar asked Oct 25 '25 22:10

usr


1 Answers

For integers, prefix each number with the length. To make it more readable, use 'a' for length 1, and 'b' for length 2. Example:

non-encoded  encoded
1            "a1"
3            "a3"
23           "b23"

This scheme is a bit simpler than prefixing each digit, but only works with numbers, not numbers mixed with text. It can be made to work for negative numbers as well, and even BigDecimal numbers, using some tricks. I wrote an implementation in Apache Jackrabbit 2.x, to make BigDecimal indexable (sortable) as text. For that, I used a format that only uses the characters '0' to '9' and consists of:

  • one character for: signum(value) + 2
  • one character for: signum(exponent) + 2
  • one character for: length(exponent) - 1
  • multiple characters for: exponent
  • multiple characters for: value (-1 if inverted)

Only the signum is encoded if the value is zero. The exponent is not encoded if zero. Negative values are "inverted" character by character (0 => 9, 1 => 8, and so on). The same applies to the exponent.

Examples:

non-encoded  encoded
0            "2"
2            "322" (signum 1; exponent 0; value 2)
120          "330212" (signum 1; exponent signum 1, length 1, value 2; value 12)
-1           "179" (signum -1, rest inverted; exponent 0; value 1 (-1, inverted))

Values between BigDecimal(BigInteger.ONE, Integer.MIN_VALUE) and BigDecimal(BigInteger.ONE, Integer.MAX_VALUE) are supported.

like image 158
Thomas Mueller Avatar answered Oct 28 '25 03:10

Thomas Mueller