Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read txt files (in unicode and utf8) by means of C#

I created two txt files (windows notepad) with the same content "thank you - спасибо" and saved them in utf8 and unicode. In notepad they look fine. Then I tried to read them using .Net:

...File.ReadAllText(utf8FileFullName, Encoding.UTF8);

and

...File.ReadAllText(unicodeFileFullName, Encoding.Unicode);

But in both cases I got this "thank you - ???????". What's wrong?

Upd: code for utf8

static void Main(string[] args)
        {
            var encoding = Encoding.UTF8;
            var file = new FileInfo(@"D:\encodes\enc.txt");
            Console.OutputEncoding = encoding;
            var content = File.ReadAllText(file.FullName, encoding);
            Console.WriteLine("encoding: " + encoding);
            Console.WriteLine("content: " + content);
            Console.ReadLine();
        }

Result: thanks ÑпаÑибо

like image 647
mtkachenko Avatar asked Sep 18 '13 11:09

mtkachenko


3 Answers

Edited as UTF8 should support the characters. It seems that you're outputting to a console or a location which hasn't had its encoding set. If so, you need to set the encoding. For the console you can do this

string allText = File.ReadAllText(unicodeFileFullName, Encoding.UTF8);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(allText);
like image 108
keyboardP Avatar answered Oct 21 '22 23:10

keyboardP


Use the Encoding type Default

File.ReadAllText(unicodeFileFullName, Encoding.Default);

It will fix the ???? Chracters.

like image 6
alireza amini Avatar answered Oct 21 '22 21:10

alireza amini


When outputting Unicode or UTF-8 encoded multi-byte characters to the console you will need to set the encoding as well as ensure that the console has a font set that supports the multi-byte character in order to display the corresponding glyph. With your existing code a MessageBox.Show(content) or display on a Windows or Web Form would appear correctly.

Have a look at http://msdn.microsoft.com/en-us/library/system.console.aspx for an explanation on setting fonts within the console window.

"Support for Unicode characters requires the encoder to recognize a particular Unicode character, and also requires a font that has the glyphs needed to render that character. To successfully display Unicode characters to the console, the console font must be set to a non-raster or TrueType font such as Consolas or Lucida Console."

As a side note, you can use the FileStream class to read the first three bytes of the file and look for the byte order mark indicator to automatically set the encoding when reading the file. For example, if byte[0] == 0xEF && byte[1] == 0xBB && byte[2] == 0xBF then you have a UTF-8 encoded file. Refer to http://en.wikipedia.org/wiki/Byte_order_mark for more information.

like image 3
Warren Rox Avatar answered Oct 21 '22 22:10

Warren Rox