Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple working example of GzipOutputStream and GzipInputStream with Protocol Buffers

after some days of experimenting with Protocol Buffers I tried to compress the files. With Python this is quite simple to do and does not require any play with streams.

Since most of our code is written in C++ I would like to compress/ decompress files in the same language. I've tried the boost gzip library, but could not get it to work (not compressing):

int writeEventCollection(HEP::MyProtoBufClass* protobuf, std::string filename, unsigned int compressionLevel) {
            ofstream file(filename.c_str(), ios_base::out | ios_base::binary);
            filtering_streambuf<output> out;
            out.push(gzip_compressor(compressionLevel));
            out.push(file);
            if (!protobuf->SerializeToOstream(&file)) {//serialising to wrong stream I asume
                    cerr << "Failed to write ProtoBuf." << endl;
                    return -1;
            }
            return 0;
    }

I've searched for examples utilising GzipOutputStream and GzipInputStream with Protocol Buffers but could not find a working example.

As you probably noticed by now I am a beginner at best with streams and would really appreciate a fully working example as in http://code.google.com/apis/protocolbuffers/docs/cpptutorial.html (I have my address_book, how do I save it in a gziped file?)

Thank you in advance.

EDIT: Working examples.

Example 1 following the answer here on StackOverflow

int writeEventCollection(shared_ptr<HEP::EventCollection> eCollection, 
std::string filename, unsigned int compressionLevel) { 
filtering_ostream out; 
out.push(gzip_compressor(compressionLevel)); 
out.push(file_sink(filename, ios_base::out | ios_base::binary)); 
if (!eCollection->SerializeToOstream(&out)) { 
                cerr << "Failed to write event collection." << endl; 
                return -1; 
} 

return 0; 
} 

Example 2 following answer on Google's Protobuf discussion group:

int writeEventCollection2(shared_ptr<HEP::EventCollection> 
eCollection, std::string filename, 
                        unsigned int compressionLevel) { 
using namespace google::protobuf::io; 
int filedescriptor = open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC, 
                S_IREAD | S_IWRITE); 
if (filedescriptor == -1) { 
                        throw "open failed on output file"; 
                } 
google::protobuf::io::FileOutputStream file_stream(filedescriptor); 
GzipOutputStream::Options options; 
options.format = GzipOutputStream::GZIP; 
options.compression_level = compressionLevel; 
google::protobuf::io::GzipOutputStream gzip_stream(&file_stream, 
options); 
if (!eCollection->SerializeToZeroCopyStream(&gzip_stream)) { 
     cerr << "Failed to write event collection." << endl; 
     return -1; 
     } 
close(filedescriptor); 
return 0; 
} 

Some comments on performance (reading current format and writing ProtoBuf 11146 files): Example 1:

real    13m1.185s 
user    11m18.500s 
sys     0m13.430s 
CPU usage: 65-70% 
Size of test sample: 4.2 GB (uncompressed 7.7 GB, our current compressed format: 7.7 GB)

Example 2:

real    12m37.061s 
user    10m55.460s 
sys     0m11.900s 
CPU usage: 90-100% 
Size of test sample: 3.9 GB

It seems that Google's method uses the CPU more efficiently, is slightly faster (although I expect this to be within accuracy) and produces a ~7% smaller dataset with the same compression setting.

like image 550
DragonTux Avatar asked Oct 04 '11 08:10

DragonTux


People also ask

What is protocol buffer format?

Protocol buffers are a combination of the definition language (created in . proto files), the code that the proto compiler generates to interface with data, language-specific runtime libraries, and the serialization format for data that is written to a file (or sent across a network connection).

What is Google protocol buffer used for?

What are protocol buffers? Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.

Is Protobuf compressed?

No it does not; there is no "compression" as such specified in the protobuf spec; however, it does (by default) use "varint encoding" - a variable-length encoding for integer data that means small values use less space; so 0-127 take 1 byte plus the header.


1 Answers

Your assumption is right: the code you posted doesn't work because you're writing directly to the ofstream instead of through the filtering_streambuf. To make this work you can use filtering_ostream instead:

ofstream file(filename.c_str(), ios_base::out | ios_base::binary); 
filtering_ostream out; 
out.push(gzip_compressor(compressionLevel)); 
out.push(file);

if (!protobuf->SerializeToOstream(&out)) {
    // ... etc.
}

Or more succinctly, using file_sink:

filtering_ostream out; 
out.push(gzip_compressor(compressionLevel)); 
out.push(file_sink(filename, ios_base::out | ios_base::binary));

if (!protobuf->SerializeToOstream(&out)) {
    // ... etc.
}

I hope this helps!

like image 138
ChrisN Avatar answered Nov 15 '22 00:11

ChrisN