New line character in serialized messages

Question

Some protobuf messages, when serialized to string, have new line character inside them. Usually when the first field of the message is a string then the new line character is prepended before the message. But wa also found messages with new line character somewhere in the middle.

The problem with new line character is when you want to save the messages into one file line by line. The new line character breaks the line and makes the message invalid.

example.proto

syntax = "proto3";

package data_sources;

message StringFirst {
  string key = 1;
  bool valid = 2;
}

message StringSecond {
  bool valid = 1;
  string key = 2;
}

example.py

from protocol_buffers.data_sources.example_pb2 import StringFirst, StringSecond

print(StringFirst(key='some key').SerializeToString())
print(StringSecond(key='some key').SerializeToString())

output

b'
\x08some key'
b'\x12\x08some key'

Is this expected / desired behaviour? How can one prevent the new line character?

Marc Gravell · Accepted Answer

protobuf is a binary protocol (unless you're talking about the optional json thing). So: any time you're treating it as text-like in any way, you're using it wrong and the behaviour will be undefined. This includes worrying about whether there are CR/LF characters, but it also includes things like the nul-character (0x00), which is often interpreted as end-of-string in text-based APIs in many frameworks (in particular, C-strings).

Specifically:

LF (0x0A) is identical to the field header for "field 1, length-prefixed"
CR (0x0D) is identical to the field header for "field 1, fixed 32-bit"
any of 0x00, 0x0A or 0x0D could occur as a length prefix (to signify a length of 0, 10, or 13)
any of 0x00, 0x0A or 0x0D could occur naturally in binary data (bytes)
any of 0x00, 0x0A or 0x0D could occur naturally in any numeric type
0x0A or 0x0D could occur naturally in text data (as could 0x00 if your originating framework allows nul-characters arbitrarily in strings, so... not C-strings)
and probably a range of other things

So: again - if the inclusion of "special" text characters is problematic: you're using it wrong.

The most common way to handle binary data as text is to use a base-N encode; base-16 (hex) is convenient to display and read, but base-64 is more efficient in terms of the number of characters required to convey the same number of bytes. So if possible: convert to/from base-64 as required. Base-64 never includes any of the non-printable characters, so you will never encounter CR/LF/nul.

New line character in serialized messages

Tags:

protocol-buffers

Jan Kislinger

1 Answers

Marc Gravell

Recent Activity

Donate For Us

New line character in serialized messages

Tags:

protocol-buffers

Jan Kislinger

1 Answers

Marc Gravell

Related questions

Recent Activity

Donate For Us