How to encode and decode Broken Chinese/Unicode characters?

Tags:

I've tried googling around but wasn't able to find what charset that this text below belongs to:

å…·æœ‰éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®

But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8"> and keeping that string into an HTML file, I was able to view the Chinese characters properly:

具有靜電產生裝置之影像輸入裝置

So my question is:

What tools can I use to detect the character set of this text?
And how do I convert/encode/decode them properly in C#?

Updates: For completion sake, I've updated this test.

   [TestMethod]
    public void TestMethod1()
    {
        string encodedText = "å…·æœ‰éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®";
        Encoding utf8 = new UTF8Encoding();
        Encoding window1252 = Encoding.GetEncoding("Windows-1252");

        byte[] postBytes = window1252.GetBytes(encodedText);
        
        string decodedText = utf8.GetString(postBytes);
        string actualText = "具有靜電產生裝置之影像輸入裝置";
        Assert.AreEqual(actualText, decodedText);
    }
}

353

asked Jun 10 '12 09:06

melaos

1 Answers

What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.

Here's an example:

using System.Text;
using System.Windows.Forms;

namespace Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
            Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
            Encoding Utf8 = Encoding.UTF8;
            byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
            string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
            MessageBox.Show(badDecode,"Mis-decoded");  // Shows your garbage string.
            string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
            MessageBox.Show(goodDecode, "Correctly decoded");

            // Recovering from bad decode...
            byte[] originalBytes = Windows1252.GetBytes(badDecode);
            goodDecode = Utf8.GetString(originalBytes);
            MessageBox.Show(goodDecode, "Re-decoded");
        }
    }
}

Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.

The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.

answered Sep 18 '22 16:09

Mark Tolonen

Related questions
                            
                                Playing a mp3 file at button click event in windows form
                            
                                Programmatically log a user in to asp.net membership and roles?
                            
                                Avoiding stale (logically corrupt) data when using "ConcurrentDictionary.GetOrAdd()", Repro code included
                            
                                comparing session variable value to a string
                            
                                Do we need to install Microsoft office in server for Excel import in Asp.net?
                            
                                Currency format in DataGridView in windows application
                            
                                Grid Border / Gap between cells
                            
                                Token StartElement in state EndRootElement would result in an invalid XML document
                            
                                Microsoft Add-in Framework vs OSGi?
                            
                                new email locks outlook, forces email window as topMost
                            
                                wpf force to build visual tree
                            
                                Override abstract method upon instance creation in c#
                            
                                Is there any alternate way to processing DICOM images using WPF in C# without any third party/Library?
                            
                                MonoTouch runtime test to see if it's running in the simulator
                            
                                Most Efficient Way To Represent Time periods
                            
                                .NET Collection only in VB
                            
                                How do I fix "The type or namespace name could not be found"?
                            
                                Fizzler and QuerySelectorAll
                            
                                OpenSubKey not working for Registry value I need
                            
                                Sandcastle not documenting Property summaries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to encode and decode Broken Chinese/Unicode characters?

Tags:

c#

model-view-controller

unicode

cjk

melaos

People also ask

1 Answers

Mark Tolonen

Recent Activity

Donate For Us