Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which encoding to use for reading a string from a file?

I'm parsing a file (which I don't generate) that contains a string. The string is always preceded by 2 bytes which tell me the length of the string that follows.

For example:

05 00 53 70 6F 72 74

would be:

Sport

Using a C# BinaryReader, I read the string using:

string s = new string(binaryReader.ReadChars(size));

Sometimes there's the odd funky character which seems to push the position of the stream on further than it should. For example:

0D 00 63 6F 6F 6B 20 E2 80 94 20 62 6F 6F 6B

Should be:

cook - book

and although it reads fine the stream ends up two bytes further along than it should?! (Which then messes up the rest of the parsing.)

I'm guessing it has something to do with the 0xE2 in the middle, but I'm not really sure why or how to deal with it.

Any suggestions greatly appreciated!

like image 752
Bridgey Avatar asked Jan 19 '23 18:01

Bridgey


1 Answers

My guess is that the string is encoded in UTF-8. The 3-byte sequence E2 80 94 corresponds to the single Unicode character U+2014 (EM DASH).

like image 152
Ted Hopp Avatar answered Jan 30 '23 10:01

Ted Hopp