In our website, some Mac users have troubles when they copy-paste text from PDF files into a TextArea (handled by TinyMCE). All accentuated char are corrupted, and became for example e?
for a é
, i?
for a î
, etc. I cannot reproduce this problem with a Windows computer.
When I wrote the content of the TextArea on a file (before inserting it in the database), I just discovered that the initial é
is visually different that a traditionnal é
(on Vim, see below).
Indeed :
// the corrupted é - first line of the screenshot
echo bin2hex($char); // display 65cc81
// traditionnal é
echo bin2hex('é'); // display c3a9
After searching a lot, here I am :
It seems that Mac OS copies Unicode accentuated chars as a combination of two chars: in our example, e + ́
. So far, I didn't find any solution to replace corrupted é
with the real one, to avoid e?
in the database.
And I'm a little desperate.
The process of normalizing the representation to one form or the other is called, well, normalization. In PHP there's the Normalizer
class for that, sending all input through it is a good idea:
$input = Normalizer::normalize($input);
You likely want to normalize to form C, Canonical Decomposition followed by Canonical Composition.
Should that class not be available on your system, there's the Patchwork UTF-8 library.
This is just additional to what @deceze already answered. There are multiple ways in Unicode to specify the same (in the sense of equivalence) character.
You have a common example here:
65cc81
That are two Unicode codepoints in Utf-8 encoding. 65
is e
LATIN SMALL LETTER E (U+0065) and cc81
is ́
COMBINING ACUTE ACCENT (U+0301) (it can not be displayed alone by your browser, so I took the HTML entity).
In Unicode this is called a Combining sequence. For some reason however, your database does not support it. Probably because the encoding of the column is not UTF-8 or the database connection has troubles with it.
It is canonically equivalent to
c3a9
That is a single Unicode codepoint in Utf-8 encoding. c3a9
is é
LATIN SMALL LETTER E WITH ACUTE (U+00E9). Looks like your database has no problem to deal with it, probably because it is re-encoded to Latin-1 / ISO-8859-1 by the database connection successfully.
So two ways of handling the data come to mind. It is either a problem in the re-encoding of the data or a problem storing the data.
As long as you're interested in de-composition of the composed unicode codepoint sequences, you should take the normalizer outlined by in Deceze's answer.
You can also allow UTF-8 to be stored into the database and then you should not have a problem, too.
Additionally you should probably normalize anyway so that sorting and comparing data in the database or your program works better. As you can see, the binary sequences differ which can cause problems for everything that compare on the binary level.
And sure, you save some traffic :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With