Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compact binary representation of json

Are there any compact binary representations of JSON out there? I know there is BSON, but even that webpage says "in many cases is not much more efficient than JSON. In some cases BSON uses even more space than JSON".

I'm looking for a format that's as compact as possible, preferably some kind of open standard?

like image 851
Shezan Baig Avatar asked Feb 03 '11 23:02

Shezan Baig


People also ask

How does JSON represent binary data?

The JSON format natively doesn't support binary data. The binary data has to be escaped so that it can be placed into a string element (i.e. zero or more Unicode chars in double quotes using backslash escapes) in JSON. An obvious method to escape binary data is to use Base64.

Is JSON a binary format?

The name "BSON" is based on the term JSON and stands for "Binary JSON". It is a binary form for representing simple or complex data structures including associative arrays (also known as name-value pairs), integer indexed arrays, and a suite of fundamental scalar types.

Is BSON smaller than JSON?

It is not a mandate that the BSON files are always smaller than JSON files, but it surely skips the records which are irrelevant, while in the case of JSON, you need to parse each byte. This is the main reason for using it inside MongoDB. The BSON type format is lightweight, highly traversable and fast in nature.


1 Answers

You could take a look at the Universal Binary JSON specification. It won't be as compact as Smile because it doesn't do name references, but it is 100% compatible with JSON (where as BSON and BJSON define data structures that don't exist in JSON so there is no standard conversion to/from).

It is also (intentionally) criminally simple to read and write with a standard format of:

[type, 1-byte char]([length, 4-byte int32])([data])

So simple data types begin with an ASCII marker code like 'I' for a 32-bit int, 'T' for true, 'Z' for null, 'S' for string and so on.

The format is by design engineered to be fast-to-read as all data structures are prefixed with their size so there is no scanning for null-terminated sequences.

For example, reading a string that might be demarcated like this (the []-chars are just for illustration purposes, they are not written in the format)

[S][512][this is a really long 512-byte UTF-8 string....]

You would see the 'S', switch on it to processing a string, see the 4-byte integer that follows it of "512" and know that you can just grab in one chunk the next 512 bytes and decode them back to a string.

Similarly numeric values are written out without a length value to be more compact because their type (byte, int32, int64, double) all define their length of bytes (1, 4, 8 and 8 respectively. There is also support for arbitrarily long numbers that is extremely portable, even on platforms that don't support them).

On average you should see a size reduction of roughly 30% with a well balanced JSON object (lots of mixed types). If you want to know exactly how certain structures compress or don't compress you can check the Size Requirements section to get an idea.

On the bright side, regardless of compression, the data will be written in a more optimized format and be faster to work with.

I checked the core Input/OutputStream implementations for reading/writing the format into GitHub today. I'll check in general reflection-based object mapping later this week.

You can just look at those two classes to see how to read and write the format, I think the core logic is something like 20 lines of code. The classes are longer because of abstractions to the methods and some structuring around checking the marker bytes to make sure the data file is a valid format; things like that.

If you have really specific questions like the endianness (Big) of the spec or numeric format for doubles (IEEE 754) all of that is covered in the spec doc or just ask me.

Hope that helps!

like image 146
Riyad Kalla Avatar answered Jan 02 '23 04:01

Riyad Kalla