Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bit compression in Javascript

Is there a way to compress a JavaScript array of 250+ 1s and 0s into something a little more manageable (say a shorter string) and then manageably decompress the same? Sort of like the way Google did its image encodings...

Thanks!

like image 363
Rio Avatar asked Oct 06 '22 23:10

Rio


2 Answers

I can give you almost 1:5 compression by encoding as base 32. I chose to include a simple length value to make it allow variable-length. Please see this fiddle demonstrating the technique with two functions that allow you to round-trip the value. (Or you can see an earlier, more naive hexadecimal version I created before @slebetman reminded me of the native number base conversion that exists in javascript.)

Here's sample output for one set of 250 1s and 0s. The number of characters does not count the leading "250|":

base 32, 50 chars: 250|qgl6alf1q2lbl1aclau3k5ana2kpals78alek59ilboeglajgu
base 16, 63 chars: 250|D42A6555E1D0AABA854CAABC3A155750A995578742AAEA1532AAF0E85553878

You can use a base 64 encoding to get it down to 42 characters, but be aware that with both the base 32 and base 64 versions you can end up with words in your final result that may be objectionable (please see the fiddle above for an example). The hex version can also have objectionable content, but much less so (a bad face bade a dad be a cad?)

Please let me know if you need to save 8 more characters and I will work up additional script for you. Avoiding vowels could be one way to deal with the objectionable word problem. Let me know if you need to do this as well.

If your bit strings will always be 250 characters, then the functions can be simplified a bit, but I didn't want to make this assumption.

For reference here's the bits-to-base-32 function.

function bitstringEncode(bitstring) {
    var i, l = bitstring.length,
        retval = l.toString() + '|';
    for (i = 0; i < l; i += 5) {
        retval += parseInt((bitstring.substr(i, 5) + '0000').substr(0, 5), 2).toString(32);
    }
    return retval;
}

This function will pad to the nearest 5 bits, and may generate a spurious extra character at the end for the length you are providing. I included a second version of each conversion function that pads to the nearest 10 bits, which may generate up to two spurious extra characters. I included them because if speed is important, they may (or may not) be faster, as they take larger chunks from the inputs.

like image 85
ErikE Avatar answered Oct 10 '22 02:10

ErikE


(There has not been much explanation in the other answers, so in addition to presenting my approach, I would like to discuss the approaches presented so far in my answer. Please bear with me.)

As other answers indicate, an array of bits can be treated as a stream of bits, which is essentially a rather large number written in base 2. The same number can be written in another number base. Because single characters other than decimal digits can be used for higher value digits in greater number bases (such as "F" or "f" for 15 in hexadecimal), the greater the number base, the fewer digits (characters) are required to display it.

As suggested in those answers, you can use base64 encoding and even greater bases (the Unicode Base Multilingual Plane has 65536 code points, and conforming ECMAScript implementations support that, so base 65536 is a distinct possibility although you would have to percent-encode much again for URIs), but in ECMAScript that would require a user-defined function, perhaps a library containing it; at least it requires a non-native implementation of the conversion algorithm which is necessarily slower than a native one.

Fortunately, ECMAScript implementations have built-in methods that allow you to convert numbers from one base to another, from base 2 to 36 inclusive. There is parseInt(string, radix) with which you can convert a numeric String value string written in base radix to a value of the Number type, and there is number.toString(radix) with which you can convert the Number value number into a numeric String written in base radix.

However, because the ECMAScript Number type is an implementation of IEEE-754 double-precision floating-point numbers, there are several limits to integer precision. AIUI one is that, for a bit array full of ones, unless your array does not contain more than 53 bit-elements (or your string does not contain more than 53 "1"s), you cannot convert the entire bit string forth and back without loss of precision. (The significand of IEEE-754 doubles has a precision of 53 bits.)

But you can view a large (binary) number as a concatenation of smaller (binary) number strings, split up the original bit stream in small enough chunks and convert each chunk to the greater base. In any case, information on consecutive high bits that are 0 is lost for each chunk. So when restoring the bit stream from the conversion result, you need to pad each chunk on the left with zeros so that each decoded chunk is as long as the original chunk. The chunk size needs to be weighed against the number of steps it takes to encode the stream and the number of zeros that need to be padded when decoding it.

AIUI, if you process the bit stream from left to right, the number encoded by each chunk will potentially be greater, therefore the encoded string will be potentially longer, even with a greater base, because high bits in chunks might be set (for example, compare the right-bound 11|001|001 – 3|1|1 – with the left-bound 110|010|016|2|1 –, both have chunk size 3). The reason for encoding your data in the first place was a short URI. So as the stream is finished before encoding, you should process the stream from right to left instead. (This approach also removes the necessity of including in the encoded string the number of original bits if that number is a multiple of the chunk size.)

These considerations lead to the following general (for readability, not fully optimized) functions:

/*
 * @param bitArray : Array[Number|String]
 * @param chunkSize : optional Number = 53
 * @param chunkBase: optional Number = 36
 * @param delim : optional String = ","
 *   Delimiter to use.
 * @return string
 */
function bitEncode (bitArray, chunkSize, chunkBase, delim)
{
  var chunkArray = [];
  if (!chunkSize || chunkSize < 2 || chunkSize > 53)
  {
    chunkSize = 53;
  }

  if (!chunkBase)
  {
    chunkBase = 36;
  }

  for (var i = bitArray.length; i > 0; i -= chunkSize)
  {
    var index = i - chunkSize;
    if (index < 0)
    {
      index = 0;
    }

    var slice = bitArray.slice(index, i);
    var chunk = parseInt(slice.join(""), 2).toString(chunkBase);
    chunkArray.unshift(chunk);
  }

  return chunkArray.join(delim);
}

/*
 * @param input : String
 * @param length : Number > 1
 *   Target length of input after left-padded with zeros
 * @return string
 */
function leadingZero (input, length)
{
  input = String(input);

  var inputLength = input.length;
  if (inputLength >= length)
  {
    return input;
  }

  var padding = [];
  padding.length = length + 1 - inputLength;

  return padding.join("0") + input;
}

/*
 * @param s : String
 * @param chunkSize : optional Number = 53
 * @param chunkBase : optional Number = 36
 * @param delim : optional String = ","
 * @return Array[string]
 */
function bitDecode (s, chunkSize, chunkBase, delim)
{
  var chunkArray = s.split(delim || ",");
  var bitArray = [];
  if (!chunkSize || chunkSize > 53)
  {
    chunkSize = 53;
  }

  if (!chunkBase)
  {
    chunkBase = 36;
  }

  for (var i = 0, len = chunkArray.length; i < len; ++i)
  {
    bitArray = bitArray.concat(
      leadingZero(
        parseInt(chunkArray[i], chunkBase).toString(2),
        chunkSize)
      .split(""));
  }

  return bitArray;
}

As you can see, the default chunk size here is 53 bits, and the default base is 36. Thus, an array of 250 random bits –

var a = [];
for (var i = 250; i--;)
{
  a[i] = +(Math.random() < 0.5);
}

– which could be (in right-bound chunks of 53 bits)

/*
              "11111110110011110011000011001010101010\
11010011111010010010100110100100010011001011001010111\
00100100010000101110011010000011100010010101011100011\
11100010110110111001101110000100011101101111101111100\
10001110110100010101110010011100110110100101110010011"
*/
a.join("")

would be encoded by default as

/* "3hou1lt6,21ewvahkfvb,ck8t6olnmr,26lbvliu2rg,1dh74lghy8j" (55 characters) */
var s = bitEncode(a)

and can be decoded like so:

var a = bitDecode(s);

These general functions should allow you to vary the chunk size and base so as to optimize the encoded string for your use-case. (Any potentially offensive words might have been split in two because of the delimiter.)

However, note that the decoded array will contain extra leading zeros if the original array length is not a multiple of the chunk size. If that possibility exists, and poses a problem, you can fix that problem by passing along the original length, as suggested by ErikE, and then using that value:

var originalLength = …;

a = a.slice(a.length - originalLength);

or (in all major implementations except JavaScript before version 1.6 and Opera ECMAScript before version 9.52)

a = a.slice(-originalLength);
like image 40
PointedEars Avatar answered Oct 10 '22 02:10

PointedEars