Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream while serializing with Cap'n'Proto

Tags:

c++

capnproto

Consider a Cap'n'Proto schema like this:

struct Document {
  header @0 : Header;
  records @1 :List(Record); // usually large number of records.
  footer @2 :Footer;
}
struct Header { numberOfRecords : UInt32; /* some fields */ };
struct Footer { /* some fields */ };
struct Record {
   type : UInt32;
   desc : Text;
   /* some more fields, relatively large in total */
}

Now I want to serialize (i.e. build) a document instance and stream it to a remote destination.

Since the document is usually very large I don't want to completely build it in memory before sending it. Instead I am looking for a builder that directly sends struct by struct over the wire. Such that the additional needed memory buffer is constant (i.e. O(max(sizeof(Header), sizeof(Record), sizeof(Footer))).

Looking at the tutorial material I don't find such a builder. The MallocMessageBuilder seems to create everything in memory first (then you call writeMessageToFd on it).

Does the Cap'n'Proto API support such a use-case?

Or is Cap'n'Proto more meant to be used for messages that fit into memory before sending?

In this example, the Document struct could be omitted and then one could just send a sequence of one Header message, n Record messages and one Footer. Since a Cap'n'Proto message is self-delimiting, this should work. But you loose your document root - perhaps sometimes this is not really an option.

like image 786
maxschlepzig Avatar asked Jan 16 '16 07:01

maxschlepzig


1 Answers

The solution you outlined -- sending the parts of the document as separate messages -- is probably best for your use case. Fundamentally, Cap'n Proto is not designed for streaming chunks of a single message, since that would not fit well with its random-access properties (e.g. what happens when you try to follow a pointer that points to a chunk you haven't received yet?). Instead, when you want streaming, you should split a large message into a series of smaller messages.

That said, unlike other similar systems (e.g. Protobuf), Cap'n Proto does not strictly require messages to fit into memory. Specifically, you can do some tricks using mmap(2). If your document data is coming from a file on disk, you can mmap() the file into memory and then incorporate it into your message. With mmap(), the operating system does not actually read the data from disk until you attempt to access the memory, and the OS can also purge the pages from memory after they are accessed since it knows it still has a copy on disk. This often lets you write much simpler code, since you no longer need to think about memory management.

In order to incorporate an mmap()ed chunk into a Cap'n Proto message, you'll want to use capnp::Orphanage::referenceExternalData(). For example, given:

struct MyDocument {
  body @0 :Data;
  # (other fields)
}

You might write:

// Map file into memory.
void* ptr = (kj::byte*)mmap(
    nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
if (ptr == MAP_FAILED) {
  KJ_FAIL_SYSCALL("mmap", errno);
}
auto data = capnp::Data::Reader((kj::byte*)ptr, size);

// Incorporate it into a message.
capnp::MallocMessageBuilder message;
auto root = message.getRoot<MyDocument>();
root.adoptDocumentBody(
    message.getOrphanage().referenceExternalData(data));

Because Cap'n Proto is zero-copy, it will end up writing the mmap()ed memory directly out to the socket without ever accessing it. It's then up to the OS to read the content from disk and out to the socket as appropriate.

Of course, you still have a problem on the receiving end. You'll find it a lot more difficult to design the receiving end to read into mmap()ed memory. One strategy might be to dump the entire stream directly to a file first (without involving the Cap'n Proto library), then mmap() that file and use capnp::FlatArrayMessageReader to read the mmap()ed data in-place.

I describe all this because it's a neat thing that is possible with Cap'n Proto but not most other serialization frameworks (e.g. you couldn't do this with Protobuf). Playing tricks with mmap() is sometimes really useful -- I've used this successfully in several places in Sandstorm, Cap'n Proto's parent project. However, I suspect that for your use case, splitting the document into a series of messages probably makes more sense.

like image 176
Kenton Varda Avatar answered Nov 12 '22 14:11

Kenton Varda