Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different UTF-8 signature for same diacritics (umlauts) - 2 binary ways to write umlauts

I have a quite big problem, where I can't find any help around in the web:

I moved a page from a website from OSX to Linux (both systems are running in de_DE.UTF-8) and run in an quite unknown problem: Some of the files were not found anymore, but obviously existed on the harddrive with (visibly) the same name. All those files contained german umlauts.

I took one sample image, copied the original request-uri from the webpage and called it directly - same error. After rewriting the file-name it worked. And yes, I did not mistype it!

This surprised me and I took a look into the apache-log where I found these entries:

192.168.56.10 - - [27/Aug/2012:20:03:21 +0200] "GET /images/Sch%C3%B6ne-Lau-150x150.jpg HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
192.168.56.10 - - [27/Aug/2012:20:03:57 +0200] "GET /images/Scho%CC%88ne-Lau-150x150.jpg HTTP/1.1" 404 4205 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"

That was something for me to investigate ... Here's what I found in the UTF8 chartable http://www.utf8-chartable.de/:

ö   c3 b6   LATIN SMALL LETTER O WITH DIAERESIS
¨   cc 88   COMBINING DIAERESIS

I think you've already heard of dead-keys: http://en.wikipedia.org/wiki/Dead_key If not, read the article. It's quite interesting ;)

Does that mean, that OSX saves all diacritics separate to the letter? Does that really mean, that OSX saves the character ö as o and ¨ instead of using the real character that results of the combination?

If yes, do you know of a good script that I could use to rename these files? This won't be the first page I move from OSX to Linux ...

like image 855
SimonSimCity Avatar asked Aug 27 '12 18:08

SimonSimCity


3 Answers

It's not quite the same thing as dead keys, but it's related. As you've worked out, U+00F6 and U+006F followed by U+0308 have the same visual result.

There are in fact Unicode rules in knowing to treat them the same, which is based on decompositions. There's a decomposition table in the character database, that tells us that U+00F6 canonically decomposes to U+006F followed by U+0308.

As well as canonical decomposition, there are compatibility decompositions. These lose some information, for example ² ends up being decomposed to 2. This is clearly a destructive change, but it is useful for searching when you want to be a bit fuzzy (how google knows a search for fiſh should return results about fish).

If there are more than one combining character after a non-combining character, then we can re-order them as long as we don't re-order those of the same class. This becomes clear when we consider that it doesn't matter whether we put a cedilla on something and then an acute accent, or an acute and then a cedilla, but if we put both an acute and an umlaut on a letter it clearly matters what way around they go.

From this, we have 4 normalisation forms. Put strings into an appropriate normalisation form before doing comparisons, and you don't get tripped up.

NFD: Break everything apart by canonically decomposing it as much as possible. Reorder combining characters in order of their combining class, but keep any with the same class in the same order relative to each other.

NFC: First put everything into NFD. Then continually look at the combining characters in order, if there isn't an earlier one of the same class. If there is an equivalent single character, then replace them, and re-do the scan looking to compose further.

NFKD: Like NFD, but using compatibility decomposition (damaging change, but useful for comparisons as explained above).

NFD: Do NFKD, then re-combine canonical only as per NFC.

There are also some re-combinations banned from use in NFC so that text that was valid NFC in one version of Unicode doesn't cease to be NFC if Unicode has more characters added to it.

Of NFD and NFC, NFC is clearly the more concise. It's not the most concise possible, but it is one that is very concise and can be tested for and/or created in a very efficient streaming manner.

Mac OSX uses NFD for file names. Because they're weirdos. (Okay, there are better arguments than that, they just didn't convince me!)

The Web Character Model uses NFC.* As such, you should use NFC on web stuff as much as possible. There can though be security considerations in blindly converting stuff to NFC. But if it starts from you, it should start in NFC.

Any programming language that deals with text should have a nice way of normlising text into any of these forms. If yours doesn't complain (or if yours is open source, contribute!).

See http://unicode.org/faq/normalization.html for more, or http://unicode.org/reports/tr15/ for the full gory details.

*For extra fun, if you inserted something beginning with a combining long solidus overlay (U+0338) at the start of an XML or HTML element's content, it would turn the > of the tag into , turning well-formed XML into gibberish. For this reason the web character model insists that each entity must itself be NFC and not start with a combining character.

like image 110
Jon Hanna Avatar answered Sep 21 '22 12:09

Jon Hanna


Thanks, Jon Hanna for much background-information here! This was important to get the full answer: a way to convert from the one to the other normalisation form.

As my changes are in the filesystem (because of file-upload) that is linked in the database, I now have to update my database-dump. The files got already renamed during the move (maybe by the FTP-Client ...)

Command line tools to convert charsets on Linux are:

  • iconv - converting the content of a stream (maybe a file)
  • convmv - converting the filenames in a directory

The charset utf-8-mac (as described in http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding), I could use in iconv, seems to exist just on OSX systems and so I have to move my sql-dump to my mac, convert it and move it back. Another option would be to rename the files back using convmv to NFD, but this would more hinder than help in the future, I think.

The tool convmv has a build-in (os-independent) option to enforcing NFC- or NFD-compatible filenames: http://www.j3e.de/linux/convmv/man/

PHP itself (the language my system - Wordpress is based on) supports a compatibility-layer here: In PHP, how do I deal with the difference in encoded filenames on HFS+ vs. elsewhere? After I fixed this issue for me, I will go and write some tests and may also write a bug-report to Wordpress and other systems I work with ;)

like image 38
SimonSimCity Avatar answered Sep 20 '22 12:09

SimonSimCity


Linux distros treat filenames as binary strings, meaning no encoding is assumed - though the graphical shell (Gnome, KDE, etc) might make some assumptions based on environment variables, locale, etc.

OS-X on the other hand requires or enforces (I forget which) their own version of UTF-8 with Unicode normalization to expand all diacritics into combining characters.

On Linux when people do use Unicode in filenames they tend to prefer UTF-8 with precomposed characters when it comes to diacritics.

like image 31
hippietrail Avatar answered Sep 20 '22 12:09

hippietrail