Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Encoding.Converting Latin to Hebrew

I'm trying to fetch and parse an online excel document which is written in hebrew but unfortunately in a non-hebrew encoding.

As an example I'm trying to convert the following string: "âìéåï_1", which serves as the 1st sheet name to hebrew using C# code, but I'm unable to do so.

I know the above is convertible, since when I open it up in NotePad++ and select Encoding/Character Sets/Hebrew/Windows 1255, I can see: "גליון_1" which is the correct hebrew representation of the above string.

I'm using the below code

            string str = "âìéåï_1";

            Encoding windows = Encoding.GetEncoding("Windows-1255");
            Encoding ascii = Encoding.GetEncoding("Windows-1252");
            byte[] asciiBytes = ascii.GetBytes(str);
            byte[] windowsBytes = Encoding.Convert(ascii, windows, asciiBytes);

            char[] windowsChars = new char[windows.GetCharCount(windowsBytes, 0, windowsBytes.Length)];
            windows.GetChars(windowsBytes, 0, windowsBytes.Length, windowsChars, 0);
            string windowsString = new string(windowsChars);

I assumed that the encoding of the origin string is Windows-1252 since when I paste it in NotePad++ and change the encoding to Windows-1252 the string remains the same...

I'm probably doing something wrong here, anyone know how to convert the above correctly?

Thanks,

Mikey

like image 505
Mikey S. Avatar asked Aug 29 '11 22:08

Mikey S.


1 Answers

const string Str = "âìéåï_1";

Encoding latinEncoding = Encoding.GetEncoding("Windows-1252");
Encoding hebrewEncoding = Encoding.GetEncoding("Windows-1255");

byte[] latinBytes = latinEncoding.GetBytes(Str);

string hebrewString = hebrewEncoding.GetString(latinBytes);

hebrewString:

גליון_1

In your supplied example "Window-1252" is not actualy ASCII, it is extended ASCII, and for some reason Encoding.Convert with these two encodings cannot convert extended range ASCII, so all +127 characters are converted as 63 (i.e. ?). When "converting" from one extended ASCII character byte[] to another, I would expect the bytes to be the same, it is only when you convert them to a .Net unicode string I would expect them to be different. Not sure why Convert is converting +127 chars to '?'.

like image 186
Tim Lloyd Avatar answered Oct 02 '22 05:10

Tim Lloyd