Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting unknown characters to Greek characters

Tags:

c#

encoding

I have a file which contains the following characters:

ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ

I am trying to convert that to Greek words and the result should be:

ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ

The file that the above value is stored in in Unicode format.

I am applying all possible encodings but no luck in the conversion.

private void Convert()
{
string textFilePhysicalPath = (@"C:\Users\Nec\Desktop\a.txt");
string contents = File.ReadAllText(textFilePhysicalPath);

List<string> sLines = new List<string>();
// For every encoding, get the property values.
EncodingInfo ei;
foreach (var ei in Encoding.GetEncodings())
{
    Encoding e = ei.GetEncoding();

    Encoding iso = Encoding.GetEncoding(ei.Name);
    Encoding utfx = Encoding.Unicode;
    byte[] utfBytes = utfx.GetBytes(contents);
    byte[] isoBytes = Encoding.Convert(utfx, iso, utfBytes);
    string msg = iso.GetString(isoBytes);

    string xx = (ei.Name + " " + msg);
    sLines.Add(xx);
}

using (StreamWriter file = new StreamWriter(@"C:\Users\Nec\Desktop\result.txt"))
{
    foreach (var line in sLines)
        file.WriteLine(line);
}
}

A website that converts it correctly is http://www.online-decoder.com/el but even when I use the ISO-8859-1 to ISO-8859-7 it still doesn't work in .NET.

like image 461
alwaysVBNET Avatar asked Jul 11 '18 12:07

alwaysVBNET


People also ask

Does UTF-8 include Greek?

UTF-8 uses two bytes to represent several common scripts, including the non-ASCII Latin characters, and the Cyrillic, Greek and Coptic, Arabic, Hebrew, Syriac, Armenian, and Thaana scripts, as well as combining diacriticals: the leading byte is a value that indicates that the character uses a two-byte code and ...

What encoding do Greek characters have?

Unicode is preferred for Greek in modern applications, especially as UTF-8 encoding on the Internet.


2 Answers

This is an ASCII file stored using the Greek (1253) codepage which was read using a different codepage.

File.ReadAllText tries to detect whether the file is UTF16 or UTF8 by checking the BOM bytes and falls back to UTF8 by default. UTF8 is essentially the 7-bit ANSI codepage for single-byte text, which means that trying to read a nonUnicode, nonANSI file like this will result in garbled text.

To load a file using a specific encoding/codepage, just pass the encoding as the Encoding parametter, eg :

var enc = Encoding.GetEncoding(1253);
var text=File.ReadAllText(@"189.dat",enc);

Strings in .NET are Unicode, specifically UTF16. This means that text doesn't need any conversions. Its contents will be :

'CS','C.S.F.  EXAMINATION','ΕΞΕΤΑΣΗ  Ε.Ν.Υ.'
'EH','Hb ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ'
'EP','PROTEIN ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ'
'FB','HAEMATOLOGY - FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC'
'FR','FREE TEXT',
'GT','GLUCOSE TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ'
'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ ΔΕΛΤΙΟ'
'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ'
'SE','SEMEN ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ'
'SP','SPECIAL PATHOLOGY','SPECIAL PATHOLOGY'
'ST','STOOL EXAMINATION                                 ','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ'
'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL'
'UR','URINE ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ'
'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ ΝΕΡΟΥ'
'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'

UTF16 uses two bytes for every character. If a UTF16 file was opened in a hex browser, every other character would be a NUL (0x00). It's not UTF8 either - outside the 7-bit ANSI range each character uses two or more bytes that always have the high bit set. Instead of one garbled character there would be two at least.

File and stream methods that could be affected by encoding or culture in .NET always have an overload that accepts an Encoding or CultureInfo parameter.

Console

Writing the output to the Console may display in garbled text. The text isn't really converted, just displayed the wrong way.

While the console can display Unicode text it assumes that the system's codepage is used by default. In the past it couldn't even support UTF8 as a codepage - there was no such option in the settings. After all, the label for the system locale settings is Language used for non-Unicode programs.

The latest Windows 10 Insider releases offer UTF8 as the system codepage as a beta option.

To ensure Unicode text appears properly in the console one would have to set its encoding to UTF8, eg :

var text=File.ReadAllText(@"189.dat",enc);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(text);
like image 42
Panagiotis Kanavos Avatar answered Oct 22 '22 05:10

Panagiotis Kanavos


This code converts the string from the C# which is UTF-16 to an 8-bit representation using the common ISO-8859-1 codepage. Then it converts it back to UTF-16 using the greek codepage windows-1253. The result is ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ as you want.

string errorneousString = "ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ";
byte[] asIso88591Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(errorneousString);
string asGreekString = Encoding.GetEncoding("windows-1253").GetString(asIso88591Bytes);
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(asGreekString);

Edit: Since your file is encoded in an 8-bit format, you need to specify the codepage when reading it. Use this:

string fileContents = File.ReadAllText("189.dat", Encoding.GetEncoding("windows-1253"));
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(fileContents);

That reads the content as

'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.' 'EH','Hb ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ' 'EP','PROTEIN ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ' 'FB','HAEMATOLOGY - FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC' 'FR','FREE TEXT', 'GT','GLUCOSE TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ' 'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ' 'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ ΔΕΛΤΙΟ' 'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ' 'SE','SEMEN ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ' 'SP','SPECIAL PATHOLOGY','SPECIAL PATHOLOGY' 'ST','STOOL EXAMINATION
','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ' 'SW','SEMEN WASH','SEMEN WASH' 'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL' 'UR','URINE ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ' 'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ ΝΕΡΟΥ' 'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'

like image 170
Hans Kilian Avatar answered Oct 22 '22 05:10

Hans Kilian