I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.
For now, my implementation is the following:
public static string DecodeFromUtf8(this string utf8String)
{
// read the string as UTF-8 bytes.
byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);
// convert them into unicode bytes.
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);
// builds the converted string.
return Encoding.Unicode.GetString(encodedBytes);
}
I am playing with the word "déjà"
. I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ"
.
Unfortunately, with this implementation the string just remains the same.
Where am I wrong?
Unicode code converter. Type or paste text in the green box and click on the Convert button above it. Alternative representations will appear in all the other boxes. You can also do the same in any grey box, if you want to target only certain types of escaped text.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string
. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.
public static string DecodeFromUtf8(this string utf8String)
{
// copy the string as UTF-8 bytes.
byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)utf8String[i];
}
return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}
DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà
This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string
using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)
.
Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:
public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
// the inverse of `mistake.GetString(originalBytes);`
byte[] originalBytes = mistake.GetBytes(mangledString);
return correction.GetString(originalBytes);
}
UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);
I have string that displays UTF-8 encoded characters
There is no such thing in .NET. The string class can only store strings in UTF-16 encoding. A UTF-8 encoded string can only exist as a byte[]. Trying to store bytes into a string will not come to a good end; UTF-8 uses byte values that don't have a valid Unicode codepoint. The content will be destroyed when the string is normalized. So it is already too late to recover the string by the time your DecodeFromUtf8() starts running.
Only handle UTF-8 encoded text with byte[]. And use UTF8Encoding.GetString() to convert it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With