Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shallow-copying into a protocol buffer's bytes field

Suppose I have a proto with a bytes field:

message MyProto {
    optional bytes data = 1;
}

An API that I do not control gives me a pointer to source data and its size. I want to make a MyProto out of this data without deep copying. I thought this would be easy to do, but it appears to be impossible. Deep copying is easy with set_data. Protobuf provides a set_allocated_data function, but it takes a pointer to a std::string, which does not help me, since (unless I'm mistaken) there is no way to make a std::string without deep copying into it.

void populateProto(void* data, size_t size, MyProto* message) {
    // Deep copy is fine, I guess.
    message->set_data(data, size);

    // Shallow copy would be better...
    // message->set_allocated_data( ??? );
}

Is there any way to properly populate this proto (such that it can be serialized later) without deep copying the source data into the bytes field?

I'm aware that I could manually do the serializing right away, but I'd rather not, if possible.

like image 986
Chris Avatar asked Apr 28 '17 20:04

Chris


People also ask

Does Protobuf handle endianness?

Protocol buffers messages always use little-endian encoding. Implementations running on big-endian architectures should be doing the conversions automatically. If you are receiving data in wrong order, I would suggest using protoc --decode_raw to see whether the error occurs on the transmission or reception side.

What is varint encoding?

Variable-width integers, or varints, are at the core of the wire format. They allow encoding unsigned 64-bit integers using anywhere between one and ten bytes, with small values using fewer bytes. Each byte in the varint has a continuation bit that indicates if the byte that follows it is part of the varint.

What is protocol buffer format?

Protocol buffers are a combination of the definition language (created in .proto files), the code that the proto compiler generates to interface with data, language-specific runtime libraries, and the serialization format for data that is written to a file (or sent across a network connection).

What is ByteString in Java?

Encodes text into a sequence of bytes using the named charset and returns the result as a ByteString . static ByteString. copyFrom(java.lang.String text, java.lang.String charsetName) Encodes text into a sequence of bytes using the named charset and returns the result as a ByteString .


1 Answers

Great question. The options are:

  1. UPDATE: StringPiece is obsolete according to an online developer discussion, which may render this option moot. If you can alter your .proto file, consider implementing the ctype field option for StringPiece, Google's equivalent of C++17 string_view. This is how Google would handle such a case internally. The FieldOptions message already has semantics for StringPiece, but Google has not yet open-sourced the implementation.

    message MyProto {
        bytes data = 1 [ctype = STRING_PIECE];
    }
    
  2. Use a different protocol buffer implementation, perhaps only for this particular message type. protobuf-c and protobluff are C-language implementations that look promising.

  3. Feed a buffer to your 3rd party API. I see from the comments that you can't, but I'm including it for completeness.

    ::std::string * buf = myProto->mutable_data();
    buf->resize(size);
    api(buf->data(), size); /* data is contiguous per c++11 std */
    
  4. NON STANDARD: Break encapsulation by overwriting the data in a string instance. C++ has some gnarly features that give you enough rope to hang yourself. This option is not safe and depends on your std::string implementation and other factors.

    // NEVER USE THIS IN PRODUCTION
    void string_jam(::std::string * target, void * buffer, size_t len) {
      /* On my system, std::string layout
       *   0: size_t capacity
       *   8: size_t size
       *  16: char * data (iff strlen > 22 chars) */
      assert(target->size() > 22);
      size_t * size_ptr = (size_t*)target;
      size_ptr[0] = len; // Overwrite capacity
      size_ptr[1] = len; // Overwrite length
    
      char ** buf_ptr = (char**)(size_ptr + 2); 
      free(*buf_ptr); // Free the existing buffer
      *buf_ptr = (char*)buffer; // Jam in our new buffer
    }
    

Note: Don't do this in production. This is useful for testing to measure the performance impact if you did go the zero-copy route.

If you go with option #1, it would be great if you could release the source code, as many others would benefit from this capability. Best of luck.

like image 148
Alejandro C De Baca Avatar answered Oct 11 '22 18:10

Alejandro C De Baca