C# UNICODE to ANSI conversion

Tags:

I need your help concerning something which disturbs me when working with UNICODE encoding in .NET Framework ...

I have to interface with some customer data systems with are non-UNICODE applications, and those customers have worldwide companies (Chinese, Korean, Russian, ...). So they have to provide me an ASCII 8 bits file, wich will be encoded with their Windows code page.

So, if a Greek customer sends me a text file containing 'Σ' (sigma letter '\u03A3') in a product name, I will get an equivalent letter corresponding to the 211 ANSI code point, represented in my own code page. My computer is a French Windows, which means the code page is Windows-1252, so I will have in place 'Ó' in this text file... Ok.

I know this customer is a Greek one, so I can read his file by forcing the windows-1253 code page in my import parameters.

/// <summary>
/// Convert a string ASCII value using code page encoding to Unicode encoding
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ToUnicode(string value, int codePage)
{
    Encoding windows = Encoding.Default;
    Encoding unicode = Encoding.Unicode;
    Encoding sp = Encoding.GetEncoding(codePage);
    if (sp != null && !String.IsNullOrEmpty(value))
    {
        // First get bytes in windows encoding
        byte[] wbytes = windows.GetBytes(value);

        // Check if CodePage to use is different from current Windows one
        if (windows.CodePage != sp.CodePage)
        {
            // Convert to Unicode using SP code page
            byte[] ubytes = Encoding.Convert(sp, unicode, wbytes);
            return unicode.GetString(ubytes);
        }
        else
        {
            // Directly convert to Unicode using windows code page
            byte[] ubytes = Encoding.Convert(windows, unicode, wbytes);
            return unicode.GetString(ubytes);
        }
    }
    else
    {
        return value;
    }
}

Well in the end I got 'Σ' in my application and I am able to save this into my SQL Server database. Now my application has to perform some complex computations, and then I have to give back this file to the customer with an automatic export...

So my problem is that I have to perform a UNICODE => ANSI conversion?! But this is not as simple as I thought at the beginning...

I don't want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to the customers. They will read the exported text file with their own code page so this idea was interesting for me.

But the problem is that the conversion in this way has a strange behaviour... Here are two different examples:

1st example (я)

char ya = '\u042F';
string strYa = Char.ConvertFromUtf32(ya);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251);

string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa)));
string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));

So strYa1252 contains '?', whereas strYa1251 contains valid char 'я'. So it seems it is impossible te convert to ANSI if valid code page is not indicated to Convert() function ... So nothing in Unicode Encoding class helps user to get equivalences between ANSI and UNICODE code points ? :\

2nd example (Σ)

char sigma = '\u3A3';
string strSigma = Char.ConvertFromUtf32(sigma);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253);

string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma)));
string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));

At this time, I have the correct 'Σ' in the strSigma1253 string, but I also have 'S' for strSigma1252. As indicated at the beginning, I should have 'Ó' if ANSI code has been found, or '?' if the character has not been found, but not 'S'. Why? Yes of course, a linguist could say that 'S' is equivalent to the greek Sigma character because they sound the same in both alphabets, but they don't have the same ANSI code!

So how can the Convert() function in the .NET framework manage this kind of equivalence?

And does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?

765

asked Jun 10 '13 11:06

alex

1 Answers

I should have ...'?' if the character has not been found, but not 'S'. Why?

This is known as 'best-fit' encoding, and in most cases it's a bad thing. When Windows can't encode a character to the target code page (because Σ does not exist in code page 1252), it makes best efforts to map the character to something a bit like it. This can mean losing the diacritical marks (ë→e), or mapping to a cognate (Σ→S), a character that's related (≤→=), a character that's unrelated but looks a bit similar (∞→8), or whatever other madcap replacement seemed like a good idea at the time but turns out to be culturally or mathematically offensive in practice.

You can see the tables for cp1252, including that Sigma mapping, here.

Apart from being a silent mangling of dubious usefulness, it also has some quite bad security implications. You should be able to stop it happening by setting EncoderFallback to ReplacementFallback or ExceptionFallback.

does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?

You'll have to keep a table of encodings for each customer. Read their input files using that encoding to decode; write their output files using the same encoding.

(For sanity, set new customers to UTF-8 and document that this is the preferred encoding.)

155

answered Sep 18 '22 00:09

bobince

Related questions
                            
                                BlockingCollection Max Size
                            
                                Simulate C# like events in javascript
                            
                                Check if there is any kind of PDF Reader installed
                            
                                How to add control to the beginning of the collection?
                            
                                why does c# JavaScriptSerializer.Serialize return empty square brackets
                            
                                Long polling with Nancy Async Beta
                            
                                C++ vs C#, choice in terms of performance (VS2010) [closed]
                            
                                LoadControl, usercontrol in WebMethod
                            
                                Debugging LINQ on a per-element basis
                            
                                Response.redirect is not redirecting in c#
                            
                                why my test project doesn't appear on test explorer
                            
                                C#, cast variable to Enum.GetUnderlyingType
                            
                                S3 Multipart Upload: how can I cancel one?
                            
                                Binary serialization, IFormatter: use a new one each time or store one in a field?
                            
                                Mono 3.0/Debian/asp.net - Method not found: 'System.Configuration.IConfigurationSectionHandler.Create
                            
                                LINQ: Split list into groups according to weight/size
                            
                                Generic method overloading and precedence
                            
                                DataTable to observable collection
                            
                                How to tell if code is running locally from Visual Studio/Cassini
                            
                                Asynchronous pinging

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C# UNICODE to ANSI conversion

Tags:

c#

unicode

ansi