How to convert a UTF-8 string into Unicode?

Tags:

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

For now, my implementation is the following:

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}

I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "dÃ©jÃ".

Unfortunately, with this implementation the string just remains the same.

Where am I wrong?

994

asked Jul 02 '12 12:07

remio

2 Answers

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

116

answered Nov 03 '22 01:11

bames53

I have string that displays UTF-8 encoded characters

There is no such thing in .NET. The string class can only store strings in UTF-16 encoding. A UTF-8 encoded string can only exist as a byte[]. Trying to store bytes into a string will not come to a good end; UTF-8 uses byte values that don't have a valid Unicode codepoint. The content will be destroyed when the string is normalized. So it is already too late to recover the string by the time your DecodeFromUtf8() starts running.

Only handle UTF-8 encoded text with byte[]. And use UTF8Encoding.GetString() to convert it.

answered Nov 02 '22 23:11

Hans Passant

Related questions
                            
                                Why lock(<integer var>) is not allowed, but Monitor.Enter(<integer var>) allowed?
                            
                                Auto size the controls in winform
                            
                                C# Priority Queue
                            
                                Get all sub directories that only contain files
                            
                                How to call the method in thread with arguments and return some value
                            
                                Rx - unsubscribing from events
                            
                                in c#, how can i build up array from A to ZZ that is similar to the way that excel orders columns
                            
                                htmlAttributes not merging with tag builder in my extension
                            
                                C# WinForms DragEnter never fires
                            
                                Type or namespace name 'Properties' does not exist
                            
                                How can I refactor out the required else clause?
                            
                                Visual Studio not showing server name when adding connection
                            
                                Display "Wait" screen in WPF
                            
                                Dynamically Create an Array in C# [duplicate]
                            
                                Testing null array index
                            
                                check whether array contains false?
                            
                                if(value == null) vs if(null == value) [duplicate]
                            
                                How do I use LINQ to reduce a collection of strings to one delimited string?
                            
                                How to convert int array to int?
                            
                                HTTPListener not working over network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a UTF-8 string into Unicode?

Tags:

string

c#

unicode

utf-8

remio

People also ask

2 Answers

bames53

Hans Passant

Recent Activity

Donate For Us