Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Invalid characters in File.ReadAllText

I'm calling File.ReadAllText() in a program designed to format some files that I have.

Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains (65533) symbols where the ® (174) should be.

What would cause this and how can I fix it?

like image 272
mrK Avatar asked Mar 18 '13 15:03

mrK


2 Answers

Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.

Code sample:

string readText = File.ReadAllText(path, Encoding.Default);  // <-- change the encoding to whatever the encoding really is

If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown

like image 144
David Avatar answered Oct 06 '22 22:10

David


You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)

The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.

For example:

Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);

It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.

like image 42
Jon Skeet Avatar answered Oct 07 '22 00:10

Jon Skeet