I have some UTF-8 text in a file utf8.txt. The file contains some characters that are outside the ASCII range. I tried the following code:
var fname = "utf8.txt";
var enc = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ExceptionFallback,
DecoderFallback.ExceptionFallback);
var s = System.IO.File.ReadAllText(fname, enc);
The expected behavior is that the code should throw an exception, since it is not valid ISO-8859-1 text. Instead, the behavior is that it correctly decodes the UTF-8 text into the right characters (it looks correct in the debugger).
Is this a bug in .Net?
EDIT:
The file I tested with originally was UTF-8 with BOM. If I remove the BOM, the behavior changes. It still does not throw an exception, however it produces an incorrect Unicode string (the string does not look correct in the debugger).
EDIT:
To produce my test file, run the following code:
var fname = "utf8.txt";
var utf8_bom_e_circumflex_bytes = new byte[] {0xEF, 0xBB, 0xBF, 0xC3, 0xAA};
System.IO.File.WriteAllBytes(fname, utf8_bom_e_circumflex_bytes);
EDIT:
I think I have a firm handle on what is going on (although I don't agree with part of .Net's behavior).
If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8. (I have not tested what happens if the BOM is a lie and the file is not really UTF-8) I don't agree with this behavior. I think .Net should either throw an exception or use the encoding I gave it.
If the file has no BOM, .Net has no trivial (and 100% reliable) way to determine that the text is not really ISO-8859-1, since most (all?) UTF-8 text is also valid ISO-8859-1, although gibberish. So it just follows your instructions and decodes the file with the encoding you gave it. (I do agree with this behavior)
should throw an exception, since it is not valid ISO-8859-1 text
In ISO-8859-1 all possible bytes have mappings to characters, so no exception will ever result from reading a non-ISO-8859-1 file as ISO-8859-1.
(True, all the bytes in the range 0x80–0x9F will become invisible control codes that you never want, but they're still valid, just useless. This is true of quite a few of the ISO-8859 encodings, which put the C1 control codes in the range 0x80–0x9F, but not all. You can certainly get an exception with other encodings that leave bytes unmapped, eg Windows-1252.)
If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8.
Yep. This is hinted at in the doc:
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks.
I agree with you that this behaviour is pretty stupid. I would prefer to ReadAllBytes and check it through Encoding.GetString manually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With