I've tried googling around but wasn't able to find what charset that this text below belongs to:
具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®
But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8">
and keeping that string into an HTML file, I was able to view the Chinese characters properly:
具有靜電產生裝置之影像輸入裝置
So my question is:
What tools can I use to detect the character set of this text?
And how do I convert/encode/decode them properly in C#?
Updates: For completion sake, I've updated this test.
[TestMethod]
public void TestMethod1()
{
string encodedText = "具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®";
Encoding utf8 = new UTF8Encoding();
Encoding window1252 = Encoding.GetEncoding("Windows-1252");
byte[] postBytes = window1252.GetBytes(encodedText);
string decodedText = utf8.GetString(postBytes);
string actualText = "具有靜電產生裝置之影像輸入裝置";
Assert.AreEqual(actualText, decodedText);
}
}
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.
English and the other Latin languages use ASCII encoding; Simplified Chinese uses GB2312 encoding, Traditional Chinese uses Big 5 encoding, and so forth. In other words, a computer using Big 5 encoding cannot read computer code in GB2312 or ASCII encoding.
Unicode/UTF-8 characters include: Chinese characters. any non-Latin scripts (Hebrew, Cyrillic, Japanese, etc.) symbols.
Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.
What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.
Here's an example:
using System.Text;
using System.Windows.Forms;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding Utf8 = Encoding.UTF8;
byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
MessageBox.Show(badDecode,"Mis-decoded"); // Shows your garbage string.
string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
MessageBox.Show(goodDecode, "Correctly decoded");
// Recovering from bad decode...
byte[] originalBytes = Windows1252.GetBytes(badDecode);
goodDecode = Utf8.GetString(originalBytes);
MessageBox.Show(goodDecode, "Re-decoded");
}
}
}
Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.
The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With