I'm currently trying to remove all special characters and accents from an UTF-8 string by turning them into their equivalent ASCII character if possible.
So I'm simply using this code:
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
The problem is that for example the word "début" turns into "dbut" instead of "debut". To make it work, I need to add a call to setlocale, like this:
setlocale(LC_ALL, 'en_US.UTF8');
$result = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
And I don't understand why. I thought UTF-8 and ASCII were always the same, whatever locale you use.
EDIT: I didn't mean UTF-8 equals ASCII, I meant UTF-8 always equals UTF-8 and ASCII always equals ASCII
The subset of UTF-8 that overlaps with ASCII (which is code points 0-127) is indeed identical with ASCII. However, accented latin characters are not part of the ASCII character set and if you don't setlocale
yourself, the system's default locale (which evidently does not contain these accented characters) is used to get a character set to work with.
In general, iconv
can be a little iffy; this is mentioned in the introduction of the extension:
This module contains an interface to iconv character set conversion facility. With this module, you can turn a string represented by a local character set into the one represented by another character set, which may be the Unicode character set. Supported character sets depend on the iconv implementation of your system. Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With