Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Go: Mechanics of binary encoding

Tags:

go

I'm trying to make a new binary encoding package because the standard Go encoding/binary package does not do exactly what I want.

What I do not understand is why encoding/binary is using x >>= 7 in binary.PutUvarint instead of x >>= 8. If I understand correctly this is deliberately shifting the bits by 7 instead of by 8, which results in an overall size of 80 bits to store a uint64 instead of the 64 bits that is clearly the number of bits required. Why? What is the reason for this? This must be something to do with the fact that it is generating variable length slices of bytes, but why would >>7 help this?

Here is the binary encoding function for your reference:

func PutUvarint(buf []byte, x uint64) int {
    i := 0
    for x >= 0x80 {
        buf[i] = byte(x) | 0x80
        x >>= 7
        i++
    }
    buf[i] = byte(x)
    return i + 1
}
like image 466
Alasdair Avatar asked Aug 19 '14 07:08

Alasdair


People also ask

What is a Golang binary?

Go Binaries is an open-source server allowing non-Go users to quickly install tools written in Golang, without installing the Go compiler or a package manager — all you need is curl .

What is encoding in binary?

Binary encoding uses the binary digit, or bit, as the fundamental unit of information, and a bit may only be a '0' or a '1' (only two possibilities since it is a binary-encoded system). By combining bits, numbers larger than 0 or 1 may be represented, and these bit collections are called words.


1 Answers

encoding/binary/varint.go

package binary

// This file implements "varint" encoding of 64-bit integers.
// The encoding is:
// - unsigned integers are serialized 7 bits at a time, starting with the
//   least significant bits
// - the most significant bit (msb) in each output byte indicates if there
//   is a continuation byte (msb = 1)
// - signed integers are mapped to unsigned integers using "zig-zag"
//   encoding: Positive values x are written as 2*x + 0, negative values
//   are written as 2*(^x) + 1; that is, negative numbers are complemented
//   and whether to complement is encoded in bit 0.
//
// Design note:
// At most 10 bytes are needed for 64-bit values. The encoding could
// be more dense: a full 64-bit value needs an extra byte just to hold bit 63.
// Instead, the msb of the previous byte could be used to hold bit 63 since we
// know there can't be more than 64 bits. This is a trivial improvement and
// would reduce the maximum encoding length to 9 bytes. However, it breaks the
// invariant that the msb is always the "continuation bit" and thus makes the
// format incompatible with a varint encoding for larger numbers (say 128-bit).

A classic technique for lossless data compression is Huffman coding where the more common symbols are generally represented using fewer bits than less common symbols. In practice, smaller numbers occur most often. Therefore, if we can represent small numbers efficiently, even if larger numbers are represented less efficiently, we will get lossless data compression.

Uvarints are a method of serializing unsigned integers using one or more bytes. Smaller numbers take a smaller number of bytes. Each byte in a uvarint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the number in groups of 7 bits, least significant group first. We can do this for unsigned integers with any number of bits.

For example,

uint bits : uvarint bytes
7 7f : 1 7f 
14 3fff : 2 ff7f 
21 1fffff : 3 ffff7f 
28 fffffff : 4 ffffff7f 
35 7ffffffff : 5 ffffffff7f 
42 3ffffffffff : 6 ffffffffff7f 
49 1ffffffffffff : 7 ffffffffffff7f 
56 ffffffffffffff : 8 ffffffffffffff7f 
63 7fffffffffffffff : 9 ffffffffffffffff7f 
64 ffffffffffffffff : 10 ffffffffffffffffff01 

And so on, for as many uint bits as we want.

If we have a lots of numbers that are represented as 1 to 49 bits of an unsigned 64-bit integer serialized to uvarints of 1 to 7 bytes, we won't care if we have a few numbers that are represented as 57 to 64 bits of an unsigned 64-bit integer serialized to uvarints of 9 to 10 bytes.

References:

Huffman coding

Variable-length quantity

like image 128
peterSO Avatar answered Sep 30 '22 19:09

peterSO