What is the most efficient way to remove accents from a string e.g. ÈâuÑ
becomes Eaun
?
Is there a simple, built in way that I'm missing or a regular expression?
replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.
We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove', "A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦")); // string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin
translates the cyrillic character ь
to a character, that doesn't fit into a latin character-set: ʹ
(http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove
to remove all these non-latin characters. I also added a test to the text ;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin
here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII
...
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With