Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I transform string to UTF-8 in C#?

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.

Due to incorrect encoding, a piece of my string looks like this in Spanish:

Acción

whereas it should look like this:

Acción

According to the answer on this question: How to know string encoding in C#, the encoding I am receiving should be coming on UTF-8 already, but it is read on Encoding.Default (probably ANSI?).

I am trying to transform this string into real UTF-8, but one of the problems is that I can only see a subset of the Encoding class (UTF8 and Unicode properties only), probably because I'm limited to the windows surface API.

I have tried some snippets I've found on the internet, but none of them have proved successful so far for eastern languages (i.e. korean). One example is as follows:

var utf8 = Encoding.UTF8; byte[] utfBytes = utf8.GetBytes(myString); myString= utf8.GetString(utfBytes, 0, utfBytes.Length);      

I also tried extracting the string into a byte array and then using UTF8.GetString:

byte[] myByteArray = new byte[myString.Length]; for (int ix = 0; ix < myString.Length; ++ix) {     char ch = myString[ix];     myByteArray[ix] = (byte) ch; }  myString = Encoding.UTF8.GetString(myByteArray, 0, myString.Length); 

Do you guys have any other ideas that I could try?

like image 983
Gaara Avatar asked Dec 27 '12 15:12

Gaara


People also ask

How do I encode strings to UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

How do I change the encoding of a String?

Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.

What are UTF-8 strings?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Does STD String support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.


1 Answers

As you know the string is coming in as Encoding.Default you could simply use:

byte[] bytes = Encoding.Default.GetBytes(myString); myString = Encoding.UTF8.GetString(bytes); 

Another thing you may have to remember: If you are using Console.WriteLine to output some strings, then you should also write Console.OutputEncoding = System.Text.Encoding.UTF8;!!! Or all utf8 strings will be outputed as gbk...

like image 123
anothershrubery Avatar answered Oct 19 '22 09:10

anothershrubery