Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert String (UTF-16) to UTF-8 in C#

I need to convert a string to UTF-8 in C#. I've already try many ways but none works as I wanted. I converted my string into a byte array and then to try to write it to an XML file (which encoding is UTF-8....) but either I got the same string (not encoded at all) either I got a list of byte which is useless.... Does someone face the same issue ?

Edit : This is some of the code I used :

str= "testé";
byte[] utf8Bytes = Encoding.UTF8.GetBytes(str);
return Encoding.UTF8.GetString(utf8Bytes);

The result is "testé" or I expected something like "testé"...

like image 848
Celero Avatar asked Jun 01 '11 09:06

Celero


People also ask

How do you convert UTF 16 to UTF-8 in C++?

To convert UTF-8 to UTF-16 just call Utf32To16(Utf8To32(str)) and to convert UTF-16 to UTF-8 call Utf32To8(Utf16To32(str)) .

How do I convert string to UTF?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.


3 Answers

If you want a UTF8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the followed:

public static string Utf16ToUtf8(string utf16String)
{
   /**************************************************************
    * Every .NET string will store text with the UTF16 encoding, *
    * known as Encoding.Unicode. Other encodings may exist as    *
    * Byte-Array or incorrectly stored with the UTF16 encoding.  *
    *                                                            *
    * UTF8 = 1 bytes per char                                    *
    *    ["100" for the ansi 'd']                                *
    *    ["206" and "186" for the russian 'κ']                   *
    *                                                            *
    * UTF16 = 2 bytes per char                                   *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["186, 3" for the russian 'κ']                          *
    *                                                            *
    * UTF8 inside UTF16                                          *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["206, 0" and "186, 0" for the russian 'κ']             *
    *                                                            *
    * We can use the convert encoding function to convert an     *
    * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
    * encoding to string method now, we will get a UTF16 string. *
    *                                                            *
    * So we imitate UTF16 by filling the second byte of a char   *
    * with a 0 byte (binary 0) while creating the string.        *
    **************************************************************/

    // Storage for the UTF8 string
    string utf8String = String.Empty;

    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Fill UTF8 bytes inside UTF8 string
    for (int i = 0; i < utf8Bytes.Length; i++)
    {
        // Because char always saves 2 bytes, fill char with 0
        byte[] utf8Container = new byte[2] { utf8Bytes[i], 0 };
        utf8String += BitConverter.ToChar(utf8Container, 0);
    }

    // Return UTF8
    return utf8String;
}

In my case the DLL request is a UTF8 string too, but unfortunately the UTF8 string must be interpreted with UTF16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 has to be converted to the UTF16 '–' which is 8211. If you have this case too, you can use the following instead:

public static string Utf16ToUtf8(string utf16String)
{
    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Return UTF8 bytes as ANSI string
    return Encoding.Default.GetString(utf8Bytes);
}

Or the Native-Method:

[DllImport("kernel32.dll")]
private static extern Int32 WideCharToMultiByte(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPWStr)] String lpWideCharStr, Int32 cchWideChar, [Out, MarshalAs(UnmanagedType.LPStr)] StringBuilder lpMultiByteStr, Int32 cbMultiByte, IntPtr lpDefaultChar, IntPtr lpUsedDefaultChar);

public static string Utf16ToUtf8(string utf16String)
{
    Int32 iNewDataLen = WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, utf16String.Length, null, 0, IntPtr.Zero, IntPtr.Zero);
    if (iNewDataLen > 1)
    {
        StringBuilder utf8String = new StringBuilder(iNewDataLen);
        WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, -1, utf8String, utf8String.Capacity, IntPtr.Zero, IntPtr.Zero);

        return utf8String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf8ToUtf16. Hope I could be of help.

like image 97
MEN Avatar answered Sep 28 '22 11:09

MEN


A string in C# is always UTF-16, there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...).

If you want to write the string to a XML file, just specify the encoding when you create the XmlWriter

like image 28
Thomas Levesque Avatar answered Sep 28 '22 10:09

Thomas Levesque


    private static string Utf16ToUtf8(string utf16String)
    {
        /**************************************************************
         * Every .NET string will store text with the UTF16 encoding, *
         * known as Encoding.Unicode. Other encodings may exist as    *
         * Byte-Array or incorrectly stored with the UTF16 encoding.  *
         *                                                            *
         * UTF8 = 1 bytes per char                                    *
         *    ["100" for the ansi 'd']                                *
         *    ["206" and "186" for the russian '?']                   *
         *                                                            *
         * UTF16 = 2 bytes per char                                   *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["186, 3" for the russian '?']                          *
         *                                                            *
         * UTF8 inside UTF16                                          *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["206, 0" and "186, 0" for the russian '?']             *
         *                                                            *
         * We can use the convert encoding function to convert an     *
         * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
         * encoding to string method now, we will get a UTF16 string. *
         *                                                            *
         * So we imitate UTF16 by filling the second byte of a char   *
         * with a 0 byte (binary 0) while creating the string.        *
         **************************************************************/

        // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
        byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
        byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);
        char[] chars = (char[])Array.CreateInstance(typeof(char), utf8Bytes.Length);

        for (int i = 0; i < utf8Bytes.Length; i++)
        {
            chars[i] = BitConverter.ToChar(new byte[2] { utf8Bytes[i], 0 }, 0);
        }

        // Return UTF8
        return new String(chars);
    }

In the original post author concatenated strings. Every sting operation will result in string recreation in .Net. String is effectively a reference type. As a result, the function provided will be visibly slow. Don't do that. Use array of chars instead, write there directly and then convert result to string. In my case of processing 500 kb of text difference is almost 5 minutes.

like image 32
Pavel Kasiankou Avatar answered Sep 28 '22 10:09

Pavel Kasiankou