Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BinaryReader ReadString specifying length?

I'm working on a parser to receive UDP information, parse it, and store it. To do so I'm using a BinaryReader since it will mostly be binary information. Some of it will be strings though. MSDN says for the ReadString() function:

Reads a string from the current stream. The string is prefixed with the length, encoded as an integer seven bits at a time.

And I completely understand it up until "seven bits at a time" which I tried to simply ignore until I started testing. I'm creating my own byte array before putting it into a MemoryStream and attempting to read it with a BinaryReader. Here's what I first thought would work:

byte[] data = new byte[] { 3, 0, 0, 0, (byte)'C', (byte)'a', (byte)'t', }
BinaryReader reader = new BinaryReader(new MemoryStream(data));
String str = reader.ReadString();

Knowing an int is 4 bytes (and toying around long enough to find out that BinaryReader is Little Endian) I pass it the length of 3 and the corresponding letters. However str ends up holding \0\0\0. If I remove the 3 zeros and just have

byte[] data = new byte[] { 3, (byte)'C', (byte)'a', (byte)'t', }

Then it reads and stores Cat properly. To me this conflicts with the documentation saying that the length is supposed to be an integer. Now I'm beginning to think they simply mean a number with no decimal place and not the data type int. Does this mean that a BinaryReader can never read a string larger than 127 characters (since that would be 01111111 corresponding to the 7 bits part of the documentation)?

I'm writing up a protocol and need to completely understand what I'm getting into before I pass our documentation along to our clients.

like image 872
Corey Ogburn Avatar asked Oct 31 '13 15:10

Corey Ogburn


2 Answers

I found the source code for BinaryReader. It uses a function called Read7BitEncodedInt() and after looking up that documentation and the documentation for Write7BitEncodedInt() I found this:

The integer of the value parameter is written out seven bits at a time, starting with the seven least-significant bits. The high bit of a byte indicates whether there are more bytes to be written after this one. If value will fit in seven bits, it takes only one byte of space. If value will not fit in seven bits, the high bit is set on the first byte and written out. value is then shifted by seven bits and the next byte is written. This process is repeated until the entire integer has been written.

Also, Ralf found this link that better displays what's going on.

like image 72
Corey Ogburn Avatar answered Oct 20 '22 06:10

Corey Ogburn


Unless they specifically say 'int' or 'Int32', they just mean an integer as in a whole number.

By '7 bits at time', they mean that it implements 7-bit length encoding, which seems a bit confusing at first but is actually rather straightforward. Here are some example values and how they are written out using 7-bit length encoding:

/*
decimal value   binary value                ->  enc byte 1   enc byte 2   enc byte 3
85              00000000 00000000 01010101  ->  01010101     n/a          n/a
1,365           00000000 00000101 01010101  ->  11010101     00001010     n/a
349,525         00000101 01010101 01010101  ->  11010101     10101010     00010101
*/

The table above uses big endian for no other reason than I simply had to pick one and it's what I'm most familiar with. The way 7-bit length encoding works, it is little endian by it's very nature.

Note that 85 writes out to 1 byte, 1,365 writes out to 2 bytes, and 349,525 writes out to 3 bytes.

Here's the same table using letters to show how each value's bits were used in the written output (dashes are zero-value bits, and the 0s and 1s are what's added by the encoding mechanism to indicate if a subsequent byte is to be written/read)...

/*
decimal value   binary value                ->  enc byte 1   enc byte 2   enc byte 3
85              -------- -------- -AAAAAAA  ->  0AAAAAAA     n/a          n/a
1,365           -------- -----BBB AAAAAAAA  ->  1AAAAAAA     0---BBBA     n/a
349,525         -----CCC BBBBBBBB AAAAAAAA  ->  1AAAAAAA     1BBBBBBA     0--CCCBB
*/

So values in the range of 0 to 2^7-1 (127) will write out as 1 byte, values of 2^7 (128) to 2^14-1 (16,383) will use 2 bytes, 2^14 (16,384) to 2^21-1 (2,097,151) will take 3 bytes, and so on and so forth.

like image 2
dynamichael Avatar answered Oct 20 '22 04:10

dynamichael