Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a file with unicode characters

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).

FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);

//strip out bad characters
content = content.Replace("’", "'");

This doesn't work and it changes the slanted apostrophes into ? marks.

like image 794
chris Avatar asked Apr 27 '11 00:04

chris


People also ask

How do I read a Unicode file?

If a file contains a Unicode byte order mark, it is read in that Unicode encoding, regardless of the encoding you select. If a file does not contain a Unicode byte order mark, by default the encoding is assumed to be the current locale code page character encoding, unless you select one of the Unicode encodings.

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

What is character u '\ xe9?

The unicode string for \xe9 is an accented e - é


2 Answers

I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:

content[0]; // 65533 '�'

The reason why the replace isn't working is simple - content doesn't contain the string you gave it:

content.IndexOf("’"); // -1

As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:

string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));

(See this question).

You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:

content = content.Replace("\u0092", "'");
like image 63
Justin Avatar answered Nov 15 '22 18:11

Justin


// This should replace smart single quotes with a straight single quote

Regex.Replace(content, @"(\u2018|\u2019)", "'");

//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));
like image 40
Trey Carroll Avatar answered Nov 15 '22 18:11

Trey Carroll