Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient binary serialization for Clojure/Java

I'm looking for a way to efficiently serialize Clojure objects into a binary format - i.e. not just doing the classic print and read text serialization.

i.e. I want to do something like:

(def orig-data {:name "Data Object" 
                :data (get-big-java-array) 
                :other (get-clojure-data-stuff)})

(def binary (serialize orig-data))

;; here "binary" is a raw binary form, e.g. a Java byte array
;; so it can be persisted in key/value store or sent over network etc.

;; now check it works!

(def new-data (deserialize binary))

(= new-data orig-data)
=> true

The motivation is that I have some large data structures that contain a significant amount of binary data (in Java arrays), and I want to avoid the overhead of converting these all to text and back again. In addition, I'm trying to keep the format compact in order to minimise network bandwidth usage.

Specific features I'd like to have:

  • Lightweight, pure-Java implementation
  • Support all of Clojure's standard data structures as well as all Java primitives, arrays etc.
  • No need for extra build steps / configuration files - I'd rather it just worked "out of the box"
  • Good performance both in terms of processing time required
  • Compactness in terms of binary encoded representation

What's the best / standard approach to doing this in Clojure?

like image 988
mikera Avatar asked Oct 09 '11 05:10

mikera


3 Answers

I may be missing something here, but what's wrong with the standard Java serialization? Too slow, too big, something else?

A Clojure wrapper for plain Java serialization could be something like this:

(defn serializable? [v]
  (instance? java.io.Serializable v))

(defn serialize 
  "Serializes value, returns a byte array"
  [v]
  (let [buff (java.io.ByteArrayOutputStream. 1024)]
    (with-open [dos (java.io.ObjectOutputStream. buff)]
      (.writeObject dos v))
    (.toByteArray buff)))

(defn deserialize 
  "Accepts a byte array, returns deserialized value"
  [bytes]
  (with-open [dis (java.io.ObjectInputStream.
                   (java.io.ByteArrayInputStream. bytes))]
    (.readObject dis)))

 user> (= (range 10) (deserialize (serialize (range 10))))
 true

There are values that cannot be serialized, e.g. Java streams and Clojure atom/agent/future, but it should work for most plain values, including Java primitives and arrays and Clojure functions, collections and records.

Whether you actually save anything depends. In my limited testing on smallish data sets serializing to text and binary seems to be about the same time and space.

But for the special case where the bulk of the data is arrays of Java primitives, Java serialization can be orders of magnitude faster and save a significant chunk of space. (Quick test on a laptop, 100k random bytes: serialize 0.9 ms, 100kB; text 490 ms, 700kB.)

Note that the (= new-data orig-data) test doesn't work for arrays (it delegates to Java's equals, which for arrays just tests whether it's the same object), so you may want/need to write your own equality function to test the serialization.

user> (def a (range 10))
user> (= a (range 10))
true
user> (= (into-array a) (into-array a))
false
user> (.equals (into-array a) (into-array a))
false
user> (java.util.Arrays/equals (into-array a) (into-array a))
true
like image 130
j-g-faustus Avatar answered Oct 23 '22 09:10

j-g-faustus


Nippy is one of the best choices imho: https://github.com/ptaoussanis/nippy

like image 40
mpenet Avatar answered Oct 23 '22 07:10

mpenet


Have you considered Google's protobuf? You might want to check the GitHub repository with the interface for Clojure.

like image 4
Nano Taboada Avatar answered Oct 23 '22 07:10

Nano Taboada