Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# - Detecting encoding in a file, write change to file using the found encoding

I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.

What would be the prettiest way of doing that in C# .net 2.0?

My code looks very simple as of now;

String f1 = File.ReadAllText(fileList[i]).ToLower();

if (f1.Contains(oPath))
{
    f1 = f1.Replace(oPath, nPath);
    File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}

I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.

Would greatly appreciate any help here.

like image 312
cc0 Avatar asked Dec 08 '10 09:12

cc0


2 Answers

Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read

  • http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx

The gist of the article is

  • If the BOM (byte order marker) exists then you're golden
  • Else it's guess work and heuristics

However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.

String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
  f1 = reader.ReadToEnd().ToLower();
  encoding = reader.CurrentEncoding;
}

if (f1.Contains(oPath))
{
  f1 = f1.Replace(oPath, nPath);
  File.WriteAllText(fileList[i], f1, encoding);
}
like image 192
JaredPar Avatar answered Oct 24 '22 15:10

JaredPar


By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.

my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p

like image 39
Bonshington Avatar answered Oct 24 '22 13:10

Bonshington