Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression

Question

It might help to start out with a real world example. Say I'm writing a web app that's backed by MongoDB, so my records have a long hex primary key, making my url to view a record look like /widget/55c460d8e2d6e59da89d08d0. That seems excessively long. Urls can use many more characters than that. While there are just under 8 x 10^28 (16^24) possible values in a 24 digit hex number, just limiting yourself to the characters matched by a [a-zA-Z0-9] regex class (a YouTube video id uses more), 62 characters, you can get past 8 x 10^28 in only 17 characters.

I want an algorithm that will convert any string that is limited to a specific alphabet of characters to any other string with another alphabet of characters, where the value of each character c could be thought of as alphabet.indexOf(c).

Something of the form:

convert(value, sourceAlphabet, destinationAlphabet)

Assumptions

all parameters are strings
every character in value exists in sourceAlphabet
every character in sourceAlphabet and destinationAlphabet is unique

Simplest example

var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";

But I also want it to work to convert War & Peace from the Russian alphabet plus some punctuation to the entire unicode charset and back again losslessly.

Is this possible?

The only way I was ever taught to do base conversions in Comp Sci 101 was to first convert to a base ten integer by summing digit * base^position and then doing the reverse to convert to the target base. Such a method is insufficient for the conversion of very long strings, because the integers get too big.

It certainly feels intuitively that a base conversion could be done in place, as you step through the string (probably backwards to maintain standard significant digit order), keeping track of a remainder somehow, but I'm not smart enough to work out how.

That's where you come in, StackOverflow. Are you smart enough?

Perhaps this is a solved problem, done on paper by some 18th century mathematician, implemented in LISP on punch cards in 1970 and the first homework assignment in Cryptography 101, but my searches have borne no fruit.

I'd prefer a solution in javascript with a functional style, but any language or style will do, as long as you're not cheating with some big integer library. Bonus points for efficiency, of course.

Please refrain from criticizing the original example. The general nerd cred of solving the problem is more important than any application of the solution.

Mark Adler · Accepted Answer

Here is a solution in C that is very fast, using bit shift operations. It assumes that you know what the length of the decoded string should be. The strings are vectors of integers in the range 0..maximum for each alphabet. It is up to the user to convert to and from strings with restricted ranges of characters. As for the "in-place" in the question title, the source and destination vectors can overlap, but only if the source alphabet is not larger than the destination alphabet.

/*
  recode version 1.0, 22 August 2015

  Copyright (C) 2015 Mark Adler

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  madler@alumni.caltech.edu
*/

/* Recode a vector from one alphabet to another using intermediate
   variable-length bit codes. */

/* The approach is to use a Huffman code over equiprobable alphabets in two
   directions.  First to encode the source alphabet to a string of bits, and
   second to encode the string of bits to the destination alphabet. This will
   be reasonably close to the efficiency of base-encoding with arbitrary
   precision arithmetic. */

#include <stddef.h>     // size_t
#include <limits.h>     // UINT_MAX, ULLONG_MAX

#if UINT_MAX == ULLONG_MAX
#  error recode() assumes that long long has more bits than int
#endif

/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
   code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
   *dlen returns the length of the result, which will not exceed the value of
   *dlen when called.  If the original *dlen is not large enough to hold the
   full result, then recode() will return non-zero to indicate failure.
   Otherwise recode() will return 0.  recode() will also return non-zero if
   either of the smax or dmax parameters are less than one.  The non-zero
   return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
   and 3 if any of the elements of source are greater than smax.

   Using this same operation on the result with smax and dmax reversed reverses
   the operation, restoring the original vector.  However there may be more
   symbols returned than the original, so the number of symbols expected needs
   to be known for decoding.  (An end symbol could be appended to the source
   alphabet to include the length in the coding, but then encoding and decoding
   would no longer be symmetric, and the coding efficiency would be reduced.
   This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
           const unsigned *source, size_t slen, unsigned smax)
{
    // compute sbits and scut, with which we will recode the source with
    // sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
    if (smax < 1)
        return 2;
    unsigned sbits = 0;
    unsigned scut = 1;          // 2**sbits
    while (scut && scut <= smax) {
        scut <<= 1;
        sbits++;
    }
    scut -= smax + 1;

    // same thing for dbits and dcut
    if (dmax < 1)
        return 2;
    unsigned dbits = 0;
    unsigned dcut = 1;          // 2**dbits
    while (dcut && dcut <= dmax) {
        dcut <<= 1;
        dbits++;
    }
    dcut -= dmax + 1;

    // recode a base smax+1 vector to a base dmax+1 vector using an
    // intermediate bit vector (a sliding window of that bit vector is kept in
    // a bit buffer)
    unsigned long long buf = 0;     // bit buffer
    unsigned have = 0;              // number of bits in bit buffer
    size_t i = 0, n = 0;            // source and dest indices
    unsigned sym;                   // symbol being encoded
    for (;;) {
        // encode enough of source into bits to encode that to dest
        while (have < dbits && i < slen) {
            sym = source[i++];
            if (sym > smax) {
                *dlen = n;
                return 3;
            }
            if (sym < scut) {
                buf = (buf << (sbits - 1)) + sym;
                have += sbits - 1;
            }
            else {
                buf = (buf << sbits) + sym + scut;
                have += sbits;
            }
        }

        // if not enough bits to assure one symbol, then break out to a special
        // case for coding the final symbol
        if (have < dbits)
            break;

        // encode one symbol to dest
        if (n == *dlen)
            return 1;
        sym = buf >> (have - dbits + 1);
        if (sym < dcut) {
            dest[n++] = sym;
            have -= dbits - 1;
        }
        else {
            sym = buf >> (have - dbits);
            dest[n++] = sym - dcut;
            have -= dbits;
        }
        buf &= ((unsigned long long)1 << have) - 1;
    }

    // if any bits are left in the bit buffer, encode one last symbol to dest
    if (have) {
        if (n == *dlen)
            return 1;
        sym = buf;
        sym <<= dbits - 1 - have;
        if (sym >= dcut)
            sym = (sym << 1) - dcut;
        dest[n++] = sym;
    }

    // return recoded vector
    *dlen = n;
    return 0;
}

/* Test recode(). */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>

// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
    unsigned bits = 0;
    unsigned long long mask = 1;
    while (mask <= max) {
        mask <<= 1;
        bits++;
    }
    mask--;
    unsigned long long ran = 0;
    unsigned have = 0;
    size_t n = 0;
    while (n < len) {
        while (have < bits) {
            ran = (ran << 31) + random();
            have += 31;
        }
        if ((ran & mask) <= max)
            vec[n++] = ran & mask;
        ran >>= bits;
        have -= bits;
    }
}

// Get a valid number from str and assign it to var
#define NUM(var, str) \
    do { \
        char *end; \
        unsigned long val = strtoul(str, &end, 0); \
        var = val; \
        if (*end || var != val) { \
            fprintf(stderr, \
                    "invalid or out of range numeric argument: %s
", str); \
            return 1; \
        } \
    } while (0)

/* "bet n m len count" generates count test vectors of length len, where each
   entry is in the range 0..n.  Each vector is recoded to another vector using
   only symbols in the range 0..m.  That vector is recoded back to a vector
   using only symbols in 0..n, and that result is compared with the original
   random vector.  Report on the average ratio of input and output symbols, as
   compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
    // get sizes of alphabets and length of test vector, compute maximum sizes
    // of recoded vectors
    unsigned smax, dmax, runs;
    size_t slen, dsize, bsize;
    if (argc != 5) { fputs("need four arguments
", stderr); return 1; }
    NUM(smax, argv[1]);
    NUM(dmax, argv[2]);
    NUM(slen, argv[3]);
    NUM(runs, argv[4]);
    dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
    bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));

    // generate random test vectors, encode, decode, and compare
    srandomdev();
    unsigned source[slen], dest[dsize], back[bsize];
    unsigned mis = 0, i;
    unsigned long long dtot = 0;
    int ret;
    for (i = 0; i < runs; i++) {
        ranvec(source, slen, smax);
        size_t dlen = dsize;
        ret = recode(dest, &dlen, dmax, source, slen, smax);
        if (ret) {
            fprintf(stderr, "encode error %d
", ret);
            break;
        }
        dtot += dlen;
        size_t blen = bsize;
        ret = recode(back, &blen, smax, dest, dlen, dmax);
        if (ret) {
            fprintf(stderr, "decode error %d
", ret);
            break;
        }
        if (blen < slen || memcmp(source, back, slen))  // blen > slen is ok
            mis++;
    }
    if (mis)
        fprintf(stderr, "%u/%u mismatches!
", mis, i);
    if (ret == 0)
        printf("mean dest/source symbols = %.4f (optimal = %.4f)
",
               dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
    return 0;
}

גלעד ברקן · Answer

As has been pointed out in other StackOverflow answers, try not to think of summing digit * base^position as converting it to base ten; rather, think of it as directing the computer to generate a representation of the quantity represented by the number in its own terms (for most computers probably closer to our concept of base 2). Once the computer has its own representation of the quantity, we can direct it to output the number in any way we like.

By rejecting "big integer" implementations and asking for letter-by-letter conversion you are at the same time arguing that the numerical/alphabetical representation of quantity is not actually what it is, namely that each position represents a quantity of digit * base^position. If the nine-millionth character of War and Peace does represent what you are asking to convert it from, then the computer at some point will need to generate a representation for Д * 33^9000000.

Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression

Tags:

algorithm

math

encoding

cryptography

compression

Assumptions

Simplest example

Is this possible?

Erik R.

2 Answers

Mark Adler

גלעד ברקן

Recent Activity

Donate For Us

Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression

Tags:

algorithm

math

encoding

cryptography

compression

Assumptions

Simplest example

Is this possible?

Erik R.

2 Answers

Mark Adler

גלעד ברקן

Related questions

Recent Activity

Donate For Us