Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

StreamReader is unable to correctly read extended character set (UTF8)

I am having an issue where I am unable to read a file that contains foreign characters. The file, I have been told, is encoded in UTF-8 format.

Here is the core of my code:

using (FileStream fileStream = fileInfo.OpenRead())
{
    using (StreamReader reader = new StreamReader(fileStream, System.Text.Encoding.UTF8))
    {
        string line;

        while (!string.IsNullOrEmpty(line = reader.ReadLine()))
        {
            hashSet.Add(line);
        }
    }
}

The file contains the word "achôcre" but when examining it during debugging it is adding it as "ach�cre".

(This is a profanity file so I apologize if you speak French. I for one, have no idea what that means)

like image 634
PolandSpring Avatar asked Jul 11 '11 23:07

PolandSpring


1 Answers

The evidence clearly suggests that the file is not in UTF-8 format. Try System.Text.Encoding.Default and see if you get the correct text then — if you do, you know the file is in Windows-1252 (assuming that is your system default codepage). In that case, I recommend that you open the file in Notepad, then re-“Save As” it as UTF-8, and then you can use Encoding.UTF8 normally.

Another way to check what encoding the file is actually in is to open it in your browser. If the accents display correctly, then the browser has detected the correct character set — so look at the “View / Character set” menu to find out which one is selected. If the accents are not displaying correctly, then change the character set via that menu until they do.

like image 158
Timwi Avatar answered Sep 28 '22 03:09

Timwi