Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Decode double encoded utf-8 char

I am parsing a websocket message and due do a bug in a specific socket.io version (Unfortunately I don't have control over the server side), some of the payload is double encoded as utf-8:

The correct value would be Wrocławskiej (note the l letter which is LATIN SMALL LETTER L WITH STROKE) but I actually get back WrocÅawskiej.

I already tried to decode/encode it again with java

String str = new String(wrongEncoded.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);

Unfortunately the string stays the same. Any idea on how to do a double decoding in java? I saw a python version where they convert it to raw_unicode first and then parse it again, but I don't know this works or if there is a similar solution for Java. I already read through a couple of posts on that topic, but none helped.

Edit: To clarify in Fiddler I receive the following byte sequence for the above mentionend word:

WrocÃÂawskiej

byte[] arrOutput = { 0x57, 0x72, 0x6F, 0x63, 0xC3, 0x85, 0xC2, 0x82, 0x61, 0x77, 0x73, 0x6B, 0x69, 0x65, 0x6A };
like image 662
Christoph S Avatar asked Jun 29 '17 16:06

Christoph S


People also ask

What is Standardcharsets utf_8 in Java?

Introduction. When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8. UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.

Is Java UTF-8 or 16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.

Is Java a UTF-8 string?

String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.

How do I convert a string to UTF-8 in Java?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.


1 Answers

You text was encoding to UTF-8, those bytes were then interpreted as ISO-8859-1 and re-encoded to UTF-8.

Wrocławskiej is unicode: 0057 0072 006f 0063 0142 0061 0077 0073 006b 0069 0065 006a
Encoding to UTF-8 it is: 57 72 6f 63 c5 82 61 77 73 6b 69 65 6a

In ISO-8859-1, c5 is Å and 82 is undefined.
As ISO-8859-1, those bytes are: WrocÅawskiej
Encoding to UTF-8 it is: 57 72 6f 63 c3 85 c2 82 61 77 73 6b 69 65 6a
Those are likely the bytes you are receiving.

So, to undo that, you need:

String s = new String(bytes, StandardCharsets.UTF_8);

// fix "double encoding"
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
like image 153
Andreas Avatar answered Oct 22 '22 07:10

Andreas