Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid browsers Unicode normalization when submitting a form with Unicode

When rendering the following Unicode text in HTML, it turns out that the browser (Google Chrome) do some form of Unicode normalization when posting the data back to the server. (Probably in Form C).

But when using Biblical Hebrew (בְּרִיךְ הוּא) text, this can easily break the text, as it outlined in here (page 9).

Is there any way to avoid the browsers auto text normalization?

I wrote a blog post that describe in more details the issue that I'm facing: http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text

like image 492
Fitzchak Yitzchaki Avatar asked Jun 24 '12 10:06

Fitzchak Yitzchaki


2 Answers

This seems to a be a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC, which means, among other things, reordering consecutive combining marks to a “canonical” order. This was new to me, and bad news in cases like this. The worst thing is that different browsers behave differently.

Using a simplified version of your test case http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text (using a server-side script that just echoes the raw data), I noticed that Chrome and Safari reorder the diacritic marks in U+05E9 U+05C1 U+05B5 (SHIN, SHIN DOT, TSERE), whereas IE, Firefox, and Opera do not.

I also ran a simple test with Latin letter e followed by combinining diaeresis U+0308. WebKit browsers convert it to the single character ë, as per NFC rules, whereas other browsers keep the character pair intact.

This seems to be an intentional feature, ever since 2006; https://bugs.webkit.org/show_bug.cgi?id=8769 proudly announces this as part of a bug fix! This might explain the status of the W3C policy document; its current version is WebKit-minded in this issue, but other browser vendors either aren’t interested or knowingly oppose the idea of “early normalization.”

I don’t think there is a way to prevent this. But you could warn users against using Chrome and Safari. You could even use a hidden field containing a simple problem case, then check server side whether it was transmitted as−is, and tell the user to change browser if it isn’t.

Fixing the order server-side isn’t simple, because common normalization routines apparently do not support the order needed. You could normalize to fully decomposed form (NFD), then reorder combining marks using your own code for the purpose. Perhaps simpler and safer, you could just run an ad hoc replacement routine that replaces sequences of combining marks with other sequences. This would be safer because it would not affect characters other than those you want to affect, whereas NFD decomposes Latin letters with diacritics, among other things.

According to Unicode principles, canonically equivalent strings (e.g., differing only in the order of consecutive diacritic marks) are different representations of the same data but distinct as sequences of Unicode characters (code points); they are not expected to differ in presentation, but they may, and often do. Generally, you should not expect programs to treat canonically equivalent strings as different, though programs may make a difference. See Unicode Normalization FAQ.

The FAQ entry claims that the problems of biblical Hebrew have been solved by the introduction of COMBINING GRAPHEME JOINER. Although it prevents the reordering in Chrome, it’s a clumsy method, and it may mess up rendering (it does in web browsers; diacritic marks may get badly misplaced).

like image 87
Jukka K. Korpela Avatar answered Nov 12 '22 17:11

Jukka K. Korpela


It is possible to avoid the string normalization by sending a Uint8Array rather than a string. First, get the UTF-8 data of your string as a Uint8Array as described here by @Moshev:

function utf8AbFromStr(str) {
    var strUtf8 = unescape(encodeURIComponent(str));
    var ab = new Uint8Array(strUtf8.length);
    for (var i = 0; i < strUtf8.length; i++) {
        ab[i] = strUtf8.charCodeAt(i);
    }
    return ab;
}

Then you can POST that Uint8Array with plain XHR or your favorite Ajax library. If you're using jQuery, keep in mind that you need to specify processData: false to prevent jQuery from trying to stringify it and undoing all of your hard work.

like image 3
sethobrien Avatar answered Nov 12 '22 18:11

sethobrien