Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the reason behind ZigZag encoding in Protocol Buffers and Avro?

ZigZag requires a lot of overhead to write/read numbers. Actually I was stunned to see that it doesn't just write int/long values as they are, but does a lot of additional scrambling. There's even a loop involved: https://github.com/mardambey/mypipe/blob/master/avro/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryEncoder.java#L90

I don't seem to be able to find in Protocol Buffers docs or in Avro docs, or reason myself, what's the advantage of scrambling numbers like that? Why is it better to have positive and negative numbers alternated after encoding?

Why they're not just written in little-endian, big-endian, network order which would only require reading them into memory and possibly reverse bit endianness? What do we buy paying with performance?

like image 307
Endrju Avatar asked Nov 26 '15 09:11

Endrju


People also ask

What is ZigZag encoding?

ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too.

What encoding is protobuf?

Protobuf strings are always valid UTF-8 strings. See the Language Guide: A string must always contain UTF-8 encoded or 7-bit ASCII text. (And ASCII is always also valid UTF-8.)

How does Protobuf serialize string?

The Protobuf serialization mechanism is given through the protoc application, this compiler will parse the . proto file and will generate as output, source files according to the configured language by its arguments, in this case, C++. You can also obtain more information about, reading the section compiler invocation.


1 Answers

It is a variable length 7-bit encoding. The first byte of the encoded value has it high bit set to 0, subsequent bytes have it at 1. Which is the way the decoder can tell how many bytes were used to encode the value. Byte order is always little-endian, regardless of the machine architecture.

It is an encoding trick that permits writing as few bytes as needed to encode the value. So an 8 byte long with a value between -64 and 63 takes only one byte. Which is common, the range provided by long is very rarely used in practice.

Packing the data tightly without the overhead of a gzip-style compression method was the design goal. Also used in the .NET Framework. The processor overhead needed to en/decode the value is inconsequential. Already much lower than a compression scheme, it is a very small fraction of the I/O cost.

like image 114
Hans Passant Avatar answered Sep 24 '22 08:09

Hans Passant