While trying to process a JSON response with GSON (the output is from the flickr API in case you're asking) I encountered what I'd describe as a pretty weird encoding of certain special chars:
Here's a hex view of it:
The 'u' followed by the 'double-dots' is what's supposed to be a German 'ü', and this is where my confusion starts. It's as if someone took the char and ripped it in half, encoding each of the 2 pieces. The following image shows the hex encoding of what I'd expect it to be in case the 'ü' was correctly encoded:
Even more weird, in cases where I would expect problems to occur (namely, the Asian character set) everything seems to work fine, e.g. "title": "ナガレテユク・・・"
Questions:
What you're seeing there is a case of Unicode decomposition:
Characters like German umlauts can be expressed in two ways:
ü
oru
followed by a combining diaeresis ̈_
(I had to use an underscore here to make it show up because it's not supposed to stand alone, it's really just the to "hovering dots")If you receive something like this, it's easily converted into precomposed form by using java.text.Normalizer
(available since Java 1.6):
String decomposed = "Mitgef\u0308hl";
printChars(decomposed); // Mitgefühl -- [M, i, t, g, e, f, u, ̈, h, l]
String precomposed = Normalizer.normalize(decomposed, Form.NFC);
printChars(precomposed); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]
// Normalizing with NFC again doesn't hurt:
String precomposedAgain = Normalizer.normalize(precomposed, Form.NFC);
printChars(precomposedAgain); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]
...
static void printChars(String s) {
System.out.println(s + " -- " + Arrays.toString(s.toCharArray()));
}
As you can see, applying NFC to an already precomposed string doesn't hurt.
Note that printing the String
will look correctly on any Unicode-capable terminal, only if you print the character array you see the difference between decomposed and precomposed form.
A possible source might be MacOS that tends to encode things in decomposed form, it's curious that Flickr doesn't normalize this stuff, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With