Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to compress or encode a list of numbers into a single alphanumeric string?

What's the best way to compress or encode a list of numbers of arbitrary length and sizes into a single alphanumeric string?

The goal is to be able to convert something like 1,5,8,3,20,212,42 into something like a8D1jN to be used in a URL, and then back to 1,5,8,3,20,212,42.

For the resulting string I'm fine with any number and any ascii character, lowercase and uppercase, so: 0-9a-zA-Z. I prefer not to have any punctuation whatsoever.

UPDATE: added note about which characters are fine.

like image 877
pupeno Avatar asked Oct 04 '10 18:10

pupeno


1 Answers

If you consider your list as a string, then you have 11 different characters to encode (0-9 and comma). This can be expressed in 4 bits. If you were willing to add, say $ and ! to your list of acceptable characters, then you would have 64 different output characters, and thus be able to encode 6 bits per character.

This would mean that you could map the string to an encoded string that would be about 30% shorter than the original one, and fairly obfuscated and random looking.

This way you could transcode the number series [1,5,8,3,20,212,42] to the string "gLQfoIcIeQqq".

UPDATE: I felt inspired and wrote a python solution for this solution (not fast but functional enough...)

ZERO = ord('0')
OUTPUT_CHARACTERS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789$!"

def encode(numberlist):

    # convert to string -> '1,5,8,3,20,212,42'
    s = str(numberlist).replace(' ','')[1:-1]

    # convert to four bit values -> ['0010', '1011', '0110', ... ]
    # (add 1 to avoid the '0000' series used for padding later)
    four_bit_ints = [0 <= (ord(ch) - ZERO) <= 9 and (ord(ch) - ZERO) + 1 or 11 for ch in s]
    four_bits = [bin(x).lstrip('-0b').zfill(4) for x in four_bit_ints]

    # make binary string and pad with 0 to align to 6 -> '00101011011010111001101101...'
    bin_str = "".join(four_bits)
    bin_str = bin_str + '0' * (6 - len(bin_str) % 6)

    # split to 6bit blocks and map those to ints
    six_bits = [bin_str[x * 6 : x * 6 + 6] for x in range(0, len(bin_str) / 6)]
    six_bit_ints = [int(x, 2) for x in six_bits]

    # map the 6bit integers to characters
    output = "".join([OUTPUT_CHARACTERS[x] for x in six_bit_ints])

    return output

def decode(input_str):

    # map the input string from characters to 6bit integers, and convert those to bitstrings
    six_bit_ints = [OUTPUT_CHARACTERS.index(x) for x in input_str]
    six_bits = [bin(x).lstrip('-0b').zfill(6) for x in six_bit_ints]

    # join to a single binarystring
    bin_str = "".join(six_bits)

    # split to four bits groups, and convert those to integers
    four_bits = [bin_str[x * 4 : x * 4 + 4] for x in range(0, len(bin_str) / 4)]
    four_bit_ints = [int(x, 2) for x in four_bits]

    # filter out 0 values (padding)
    four_bit_ints = [x for x in four_bit_ints if x > 0]

    # convert back to the original characters -> '1',',','5',',','8',',','3',',','2','0',',','2','1','2',',','4','2'
    chars = [x < 11 and str(x - 1) or ',' for x in four_bit_ints]

    # join, split on ',' convert to int
    output = [int(x) for x in "".join(chars).split(',') if x]

    return output


if __name__ == "__main__":

    # test
    for i in range(100):
        numbers = range(i)
        out = decode(encode(numbers))
        assert out == numbers

    # test with original series
    numbers = [1,5,8,3,20,212,42]
    encoded = encode(numbers)
    print encoded         # prints 'k2UBsZgZi7uW'
    print decode(encoded) # prints [1, 5, 8, 3, 20, 212, 42]
like image 66
AHM Avatar answered Oct 15 '22 20:10

AHM