Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Variable length encoding of an integer

Tags:

variables

c#

uint

Whats the best way of doing variable length encoding of an unsigned integer value in C# ?


"The actual intent is to append a variable length encoded integer (bytes) to a file header."

For ex: "Content-Length" - Http Header

Can this be achieved with some changes in the logic below.


I have written some code which does that ....

like image 862
Kreez Avatar asked Aug 25 '10 06:08

Kreez


2 Answers

A method I have used, which makes smaller values use fewer bytes, is to encode 7 bits of data + 1 bit of overhead pr. byte.

The encoding works only for positive values starting with zero, but can be modified if necessary to handle negative values as well.

The way the encoding works is like this:

  • Grab the lowest 7 bits of your value and store them in a byte, this is what you're going to output
  • Shift the value 7 bits to the right, getting rid of those 7 bits you just grabbed
  • If the value is non-zero (ie. after you shifted away 7 bits from it), set the high bit of the byte you're going to output before you output it
  • Output the byte
  • If the value is non-zero (ie. same check that resulted in setting the high bit), go back and repeat the steps from the start

To decode:

  • Start at bit-position 0
  • Read one byte from the file
  • Store whether the high bit is set, and mask it away
  • OR in the rest of the byte into your final value, at the bit-position you're at
  • If the high bit was set, increase the bit-position by 7, and repeat the steps, skipping the first one (don't reset the bit-position)
          39    32 31    24 23    16 15     8 7      0
value:            |DDDDDDDD|CCCCCCCC|BBBBBBBB|AAAAAAAA|
encoded: |0000DDDD|xDDDDCCC|xCCCCCBB|xBBBBBBA|xAAAAAAA| (note, stored in reverse order)

As you can see, the encoded value might occupy one additional byte that is just half-way used, due to the overhead of the control bits. If you expand this to a 64-bit value, the additional byte will be completely used, so there will still only be one byte of extra overhead.

Note: Since the encoding stores values one byte at a time, always in the same order, big- or little-endian systems will not change the layout of this. The least significant byte is always stored first, etc.

Ranges and their encoded size:

          0 -         127 : 1 byte
        128 -      16.383 : 2 bytes
     16.384 -   2.097.151 : 3 bytes
  2.097.152 - 268.435.455 : 4 bytes
268.435.456 -   max-int32 : 5 bytes

Here's C# implementations for both:

void Main()
{
    using (FileStream stream = new FileStream(@"c:\temp\test.dat", FileMode.Create))
    using (BinaryWriter writer = new BinaryWriter(stream))
        writer.EncodeInt32(123456789);

    using (FileStream stream = new FileStream(@"c:\temp\test.dat", FileMode.Open))
    using (BinaryReader reader = new BinaryReader(stream))
        reader.DecodeInt32().Dump();
}

// Define other methods and classes here

public static class Extensions
{
    /// <summary>
    /// Encodes the specified <see cref="Int32"/> value with a variable number of
    /// bytes, and writes the encoded bytes to the specified writer.
    /// </summary>
    /// <param name="writer">
    /// The <see cref="BinaryWriter"/> to write the encoded value to.
    /// </param>
    /// <param name="value">
    /// The <see cref="Int32"/> value to encode and write to the <paramref name="writer"/>.
    /// </param>
    /// <exception cref="ArgumentNullException">
    /// <para><paramref name="writer"/> is <c>null</c>.</para>
    /// </exception>
    /// <exception cref="ArgumentOutOfRangeException">
    /// <para><paramref name="value"/> is less than 0.</para>
    /// </exception>
    /// <remarks>
    /// See <see cref="DecodeInt32"/> for how to decode the value back from
    /// a <see cref="BinaryReader"/>.
    /// </remarks>
    public static void EncodeInt32(this BinaryWriter writer, int value)
    {
        if (writer == null)
            throw new ArgumentNullException("writer");
        if (value < 0)
            throw new ArgumentOutOfRangeException("value", value, "value must be 0 or greater");

        do
        {
            byte lower7bits = (byte)(value & 0x7f);
            value >>= 7;
            if (value > 0)
                lower7bits |= 128;
            writer.Write(lower7bits);
        } while (value > 0);
    }

    /// <summary>
    /// Decodes a <see cref="Int32"/> value from a variable number of
    /// bytes, originally encoded with <see cref="EncodeInt32"/> from the specified reader.
    /// </summary>
    /// <param name="reader">
    /// The <see cref="BinaryReader"/> to read the encoded value from.
    /// </param>
    /// <returns>
    /// The decoded <see cref="Int32"/> value.
    /// </returns>
    /// <exception cref="ArgumentNullException">
    /// <para><paramref name="reader"/> is <c>null</c>.</para>
    /// </exception>
    public static int DecodeInt32(this BinaryReader reader)
    {
        if (reader == null)
            throw new ArgumentNullException("reader");

        bool more = true;
        int value = 0;
        int shift = 0;
        while (more)
        {
            byte lower7bits = reader.ReadByte();
            more = (lower7bits & 128) != 0;
            value |= (lower7bits & 0x7f) << shift;
            shift += 7;
        }
        return value;
    }
}
like image 150
Lasse V. Karlsen Avatar answered Oct 13 '22 21:10

Lasse V. Karlsen


You should first make an histogram of your value. If the distribution is random (that is, every bin of your histogram's count is close to the other), then you'll not be able encode more efficiently than the binary representation for this number.

If your histogram is unbalanced (that is, if some values are more present than others), then it might make sense to choose an encoding that's using less bits for these values, while using more bits for the other -unlikely- values.

For example, if the number you need to encode are 2x more likely to be smaller than 15 bits than larger, you can use the 16-th bit to tell so and only store/send 16 bits (if it's zero, then the upcoming byte will form a 16-bits numbers that can fit in a 32 bits number). If it's 1, then the upcoming 25 bits will form a 32 bits numbers. You loose one bit here but because it's unlikely, in the end, for a lot of number, you win more bits.

Obviously, this is a trivial case, and the extension of this to more than 2 cases is the Huffman algorithm that affect a "code word" that close-to optimum based on the probability of the numbers to appear.

There's also the arithmetic coding algorithm that does this too (and probably other).

In all cases, there is no solution that can store random value more efficiently than what's being done currently in computer memory.

You have to think about how long and how hard will be the implementation of such solution compared to the saving you'll get in the end to know if it's worth it. The language itself is not relevant here.

like image 33
xryl669 Avatar answered Oct 13 '22 21:10

xryl669