Non English characters not preserved when rewriting text

Question

I've an issue on a customer site where lines containing words like "HabitaþÒo" get mangled on output. I'm processing a text file (pulling out selected lines and writing them to another file)

For diagnosis I've boiled the problem down to a file with just that bad word.

The original file contains no BOM but .net chooses to read it as UTF-8.

When read and written the word ends up looking like this "Habita��o".

A hex dump of the BadWord.txt file looks like this

enter image description here

Copying the file with this code

using (var reader = new StreamReader(@"C:\BadWord.txt"))
using (var writer = new StreamWriter(@"C:\BadWordReadAndWritten.txt"))
    writer.WriteLine(reader.ReadLine());

. . . gives . . .

enter image description here

Preserving the readers encoding doesn't do anything either

using (var reader = new StreamReader(@"C:\BadWord.txt"))
using (var writer = new StreamWriter(@"C:\BadWordReadAndWritten_PreseveEncoding.txt", false, reader.CurrentEncoding))
    writer.WriteLine(reader.ReadLine());

. . . gives . . . enter image description here

Any ideas what's going on here, how can I process this file and preserve the original text?

Esailija · Accepted Answer

The only way to do it is to read the file in the same encoding, that it has been encoded in. This means Windows-1252:

Encoding enc = Encoding.GetEncoding(1252);
string correctText = File.ReadAllText(@"C:\BadWord.txt", enc);

Non English characters not preserved when rewriting text

Tags:

c#

.net

text

file-io

character-encoding

Binary Worrier

1 Answers

Esailija

Recent Activity

Donate For Us

Non English characters not preserved when rewriting text

Tags:

c#

.net

text

file-io

character-encoding

Binary Worrier

1 Answers

Esailija

Related questions

Recent Activity

Donate For Us