Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non English characters not preserved when rewriting text

I've an issue on a customer site where lines containing words like "HabitaþÒo" get mangled on output. I'm processing a text file (pulling out selected lines and writing them to another file)

For diagnosis I've boiled the problem down to a file with just that bad word.

The original file contains no BOM but .net chooses to read it as UTF-8.

When read and written the word ends up looking like this "Habita��o".

A hex dump of the BadWord.txt file looks like this

enter image description here

Copying the file with this code

using (var reader = new StreamReader(@"C:\BadWord.txt"))
using (var writer = new StreamWriter(@"C:\BadWordReadAndWritten.txt"))
    writer.WriteLine(reader.ReadLine());

. . . gives . . .

enter image description here

Preserving the readers encoding doesn't do anything either

using (var reader = new StreamReader(@"C:\BadWord.txt"))
using (var writer = new StreamWriter(@"C:\BadWordReadAndWritten_PreseveEncoding.txt", false, reader.CurrentEncoding))
    writer.WriteLine(reader.ReadLine());

. . . gives . . . enter image description here

Any ideas what's going on here, how can I process this file and preserve the original text?

like image 621
Binary Worrier Avatar asked Feb 18 '23 22:02

Binary Worrier


1 Answers

The only way to do it is to read the file in the same encoding, that it has been encoded in. This means Windows-1252:

Encoding enc = Encoding.GetEncoding(1252);
string correctText = File.ReadAllText(@"C:\BadWord.txt", enc);
like image 161
Esailija Avatar answered Mar 04 '23 05:03

Esailija