Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove accents and turn letters into "plain" ASCII characters? [duplicate]

What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?

Is there a simple, built in way that I'm missing or a regular expression?

like image 207
Mark Lalor Avatar asked Aug 22 '10 18:08

Mark Lalor


People also ask

How do you change an accented character to a regular character?

replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

How do I remove the accented character in Python?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.


2 Answers

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string); 

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

like image 105
Piskvor left the building Avatar answered Oct 19 '22 21:10

Piskvor left the building


I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):

var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',     "A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦")); // string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi " 

see: http://www.php.net/normalizer

EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.

EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)

I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...

EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.

like image 39
SimonSimCity Avatar answered Oct 19 '22 20:10

SimonSimCity