Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 Beginning of File characters are breaking serializer & readers

Okay, I'm trying to work with UTF8 text files. I'm constantly fighting the BOM chars that the writer drops in for UTF8, which blows up pretty much anything I need to use to read the file including serializers and other text readers.

I'm getting a leading six bytes of data:

0xEF
0xBB
0xBF
0xEF
0xBB
0xBF

(now that I'm looking at it, I realize there's two characters there. Is that the UTF8 BOM marker? Am I double encoding it)?

Notice the serializer encodes to UTF8, then the memory stream gets a string as UTF8, then I write the string to the file with UTF8... seems like a lot of redundancy. Thoughts?

//I'm storing this xml result to a database field. (this one includes the BOF chars)
using (MemoryStream ms = new MemoryStream())
{
    Utility.SerializeXml(ms, root);
    xml = Encoding.UTF8.GetString(ms.ToArray());

}


//later on, I would take that xml and then write it out to a file like this: 
File.WriteAllText(path, xml, Encoding.UTF8);



public static void SerializeXml(Stream output, object data)
{
    XmlSerializer xs = new XmlSerializer(data.GetType());
    XmlWriterSettings settings = new XmlWriterSettings();
    settings.Indent = true;
    settings.IndentChars = "\t";
    settings.Encoding = Encoding.UTF8;
    XmlWriter writer = XmlTextWriter.Create(output, settings);
    xs.Serialize(writer, data);
    writer.Flush();
    writer.Close();
}
like image 525
Nathan Avatar asked Nov 20 '09 22:11

Nathan


3 Answers

Yeah, that's two BOMs. You're encoding to UTF-8 twice and each time adds a pseudo-BOM, due to the extremely unfortunate fact that:

Encoding.UTF8

means “UTF-8 with a pointless, meaningless U+FEFF stuck to the front to screw up your applications”. Try instead using

new UTF8Encoding(false)

which should give you a less sucky version.

like image 130
bobince Avatar answered Nov 01 '22 09:11

bobince


Yes that is a BOM.

Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java.

The solution I used was to stick a pushback stream on the front and filter it out.

Or use a more modern version of Java.

like image 1
bmargulies Avatar answered Nov 01 '22 09:11

bmargulies


The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32.

You've got the same sequence twice.

The only good thing to do with them is ignore and/or delete them.

like image 1
Jonathan Leffler Avatar answered Nov 01 '22 08:11

Jonathan Leffler