Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a fast, compact, streamable, multi-language, strongly typed serialization format

I'm currently using JSON (compressed via gzip) in my Java project, in which I need to store a large number of objects (hundreds of millions) on disk. I have one JSON object per line, and disallow linebreaks within the JSON object. This way I can stream the data off disk line-by-line without having to read the entire file at once.

It turns out that parsing the JSON code (using http://www.json.org/java/) is a bigger overhead than either pulling the raw data off disk, or decompressing it (which I do on the fly).

Ideally what I'd like is a strongly-typed serialization format, where I can specify "this object field is a list of strings" (for example), and because the system knows what to expect, it can deserialize it quickly. I can also specify the format just by giving someone else its "type".

It would also need to be cross-platform. I use Java, but work with people using PHP, Python, and other languages.

So, to recap, it should be:

  • Strongly typed
  • Streamable (ie. read a file bit by bit without having to load it all into RAM at once)
  • Cross platform (including Java and PHP)
  • Fast
  • Free (as in speech)

Any pointers?

like image 333
sanity Avatar asked Jul 28 '09 02:07

sanity


3 Answers

Have you looked at Google Protocol buffers?:

http://code.google.com/apis/protocolbuffers/

They're cross platform (C++, Java, Python) with third party bindings for PHP also. It's fast, fairly compact and strongly typed.

There's also a useful comparison between various formats here:

http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

You might want to consider Thrift or one of the others mentioned here as well.

like image 112
Jon Avatar answered Oct 08 '22 12:10

Jon


I've had very good results parsing JSON with Jackson

Jackson is a:

  • Streaming (reading, writing)
  • FAST (measured to be faster than any other Java json parser and data binder)
  • Powerful (full data binding for common JDK classes as well as any Java bean class, Collection, Map or Enum)
  • Zero-dependency (does not rely on other packages beyond JDK)
  • Open Source (LGPL or AL)
  • Fully conformant

JSON processor (JSON parser + JSON generator) written in Java. Beyond basic JSON reading/writing (parsing, generating), it also offers full node-based Tree Model, as well as full OJM (Object/Json Mapper) data binding functionality.

Its performance is very good when compared to many other serialisation options.

like image 43
Robert Munteanu Avatar answered Oct 08 '22 14:10

Robert Munteanu


You could take a look at YAML- http://www.yaml.org/

It's a superset of JSON so the data file structure will be familiar to you. It supports some additional data types as well as the ability to use references that include a part of one data structure into another.

I don't have any idea if it will be "fast enough"- but the libyaml parser (written in C) seems pretty snappy.

like image 2
Sharpie Avatar answered Oct 08 '22 12:10

Sharpie