Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving performance of protocol buffers

I'm writing an application that needs to deserialize quickly millions of messages from a single file.

What the application does is essentially to get one message from the file, do some work and then throw away the message. Each message is composed of ~100 fields (not all of them are always parsed but I need them all because the user of the application can decide on which fields he wants to work on).

In this moment the application consists in a loop that in each iteration just executes using a readDelimitedFrom() call.

Is there a way to optimize the problem to fit better this case (splitting in multiple files, etc...). In addition, in this moment due to the number of messages and the dimension of each message, I need to gzip the file (and it is fairly effective in reducing the size since the value of the fields are quite repetitive) - this though reduces the performance.

like image 732
Sebastiano Merlino Avatar asked Feb 18 '14 13:02

Sebastiano Merlino


Video Answer


1 Answers

If CPU time is your bottleneck (which is unlikely if you are loading directly from HDD with cold cache, but could be the case in other scenarios), then here are some ways you can improve throughput:

  • If possible, use C++ rather than Java, and reuse the same message object for each iteration of the loop. This reduces the amount of time spent on memory management, as the same memory will be reused each time.

  • Instead of using readDelimitedFrom(), construct a single CodedInputStream and use it to read multiple messages like so:

    // Do this once:
    CodedInputStream cis = CodedInputStream.newInstance(input);
    
    // Then read each message like so:
    int limit = cis.pushLimit(cis.readRawVarint32());
    builder.mergeFrom(cis);
    cis.popLimit(limit);
    cis.resetSizeCounter();
    

    (A similar approach works in C++.)

  • Use Snappy or LZ4 compression rather than gzip. These algorithms still get reasonable compression ratios but are optimized for speed. (LZ4 is probably better, though Snappy was developed by Google with Protobufs in mind, so you might want to test both on your data set.)

  • Consider using Cap'n Proto rather than Protocol Buffers. Unfortunately, there is no Java version yet, but EDIT: There is capnproto-java and also implementations in many other languages. In the languages it supports it has been shown to be quite a bit faster. (Disclosure: I am the author of Cap'n Proto. I am also the author of Protocol Buffers v2, which is the version Google released open source.)

like image 155
Kenton Varda Avatar answered Oct 06 '22 20:10

Kenton Varda