PHP: Unicode accentuated char and diacritics

Question

In our website, some Mac users have troubles when they copy-paste text from PDF files into a TextArea (handled by TinyMCE). All accentuated char are corrupted, and became for example e? for a é, i? for a î, etc. I cannot reproduce this problem with a Windows computer.

When I wrote the content of the TextArea on a file (before inserting it in the database), I just discovered that the initial é is visually different that a traditionnal é (on Vim, see below).

Visual example of the problem

Indeed :

// the corrupted é - first line of the screenshot
echo bin2hex($char); // display 65cc81

// traditionnal é
echo bin2hex('é');   // display c3a9

After searching a lot, here I am : It seems that Mac OS copies Unicode accentuated chars as a combination of two chars: in our example, e + ́. So far, I didn't find any solution to replace corrupted é with the real one, to avoid e? in the database.

And I'm a little desperate.

deceze · Accepted Answer

The process of normalizing the representation to one form or the other is called, well, normalization. In PHP there's the Normalizer class for that, sending all input through it is a good idea:

$input = Normalizer::normalize($input);

You likely want to normalize to form C, Canonical Decomposition followed by Canonical Composition.

Should that class not be available on your system, there's the Patchwork UTF-8 library.

hakre · Answer

This is just additional to what @deceze already answered. There are multiple ways in Unicode to specify the same (in the sense of equivalence) character.

You have a common example here:

65cc81

That are two Unicode codepoints in Utf-8 encoding. 65 is e LATIN SMALL LETTER E (U+0065) and cc81 is ́ COMBINING ACUTE ACCENT (U+0301) (it can not be displayed alone by your browser, so I took the HTML entity).

In Unicode this is called a Combining sequence. For some reason however, your database does not support it. Probably because the encoding of the column is not UTF-8 or the database connection has troubles with it.

It is canonically equivalent to

c3a9

That is a single Unicode codepoint in Utf-8 encoding. c3a9 is é LATIN SMALL LETTER E WITH ACUTE (U+00E9). Looks like your database has no problem to deal with it, probably because it is re-encoded to Latin-1 / ISO-8859-1 by the database connection successfully.

So two ways of handling the data come to mind. It is either a problem in the re-encoding of the data or a problem storing the data.

As long as you're interested in de-composition of the composed unicode codepoint sequences, you should take the normalizer outlined by in Deceze's answer.

You can also allow UTF-8 to be stored into the database and then you should not have a problem, too.

Additionally you should probably normalize anyway so that sorting and comparing data in the database or your program works better. As you can see, the binary sequences differ which can cause problems for everything that compare on the binary level.

And sure, you save some traffic :)

PHP: Unicode accentuated char and diacritics

Tags:

php

encoding

unicode

tinymce

4wk_

2 Answers

deceze

hakre

Recent Activity

Donate For Us

PHP: Unicode accentuated char and diacritics

Tags:

php

encoding

unicode

tinymce

4wk_

2 Answers

deceze

hakre

Related questions

Recent Activity

Donate For Us