Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GSON / JSON : Weird special char (umlaut) issue

While trying to process a JSON response with GSON (the output is from the flickr API in case you're asking) I encountered what I'd describe as a pretty weird encoding of certain special chars:

Original JSON response

Here's a hex view of it:

Hex View of Original JSON response

The 'u' followed by the 'double-dots' is what's supposed to be a German 'ü', and this is where my confusion starts. It's as if someone took the char and ripped it in half, encoding each of the 2 pieces. The following image shows the hex encoding of what I'd expect it to be in case the 'ü' was correctly encoded:

Expected Hex View

Even more weird, in cases where I would expect problems to occur (namely, the Asian character set) everything seems to work fine, e.g. "title": "ナガレテユク・・・"

Questions:

  1. Is that some flickrAPI oddity or correct JSON encoding for the reposonse? Or is it rather correctly encoded JSON and it's GSON that's failing to 're-assemble' this response into the original 'ü'. Or did the author of the title message simply screw it on his part?
  2. How do I solve the problem (in case it's either JSON or GSON that's messing around, can't obviously do anything if it was the author). How do I know what 'other' chars are affected (ö and ä come to mind, but there are probably more 'special cases').
like image 984
MrCC Avatar asked Oct 24 '11 10:10

MrCC


1 Answers

What you're seeing there is a case of Unicode decomposition:

Characters like German umlauts can be expressed in two ways:

  • the more traditional precomposed form as a single character ü or
  • in decomposed form as base character u followed by a combining diaeresis ̈_ (I had to use an underscore here to make it show up because it's not supposed to stand alone, it's really just the to "hovering dots")

If you receive something like this, it's easily converted into precomposed form by using java.text.Normalizer (available since Java 1.6):

String decomposed = "Mitgef\u0308hl";
printChars(decomposed); // Mitgefühl -- [M, i, t, g, e, f, u, ̈, h, l]
String precomposed = Normalizer.normalize(decomposed, Form.NFC);
printChars(precomposed); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]

// Normalizing with NFC again doesn't hurt:
String precomposedAgain = Normalizer.normalize(precomposed, Form.NFC);
printChars(precomposedAgain); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]
...

static void printChars(String s) {
  System.out.println(s + " -- " + Arrays.toString(s.toCharArray()));
}

As you can see, applying NFC to an already precomposed string doesn't hurt.

Note that printing the String will look correctly on any Unicode-capable terminal, only if you print the character array you see the difference between decomposed and precomposed form.

A possible source might be MacOS that tends to encode things in decomposed form, it's curious that Flickr doesn't normalize this stuff, though.

like image 135
Philipp Reichart Avatar answered Sep 23 '22 05:09

Philipp Reichart