I am trying to replace accented characters with the normal replacements. Below is what I am currently doing.
$string = "Éric Cantona";
$strict = strtolower($string);
echo "After Lower: ".$strict;
$patterns[0] = '/[á|â|à|å|ä]/';
$patterns[1] = '/[ð|é|ê|è|ë]/';
$patterns[2] = '/[í|î|ì|ï]/';
$patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
$patterns[4] = '/[ú|û|ù|ü]/';
$patterns[5] = '/æ/';
$patterns[6] = '/ç/';
$patterns[7] = '/ß/';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';
$strict = preg_replace($patterns, $replacements, $strict);
echo "Final: ".$strict;
This gives me:
After Lower: éric cantona
Final: ric cantona
The above gives me ric cantona
I want the output to be eric cantona
.
can anyone help me with where I am going wrong?
$val = iconv('UTF-8','ASCII//TRANSLIT',$val); note that php has some weird bug in that it (sometimes?) needs to have a locale set to make these conversions work, using setlocale(). Note that ð is correctly transliterated to d instead of o , as in the accepted answer.
replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.
Using str_ireplace() Method: The str_ireplace() method is used to remove all the special characters from the given string str by replacing these characters with the white space (” “).
I have tried all sorts based on the variations listed in the answers, but the following worked:
$unwanted_array = array( 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );
To remove the diacritics, use iconv:
$val = iconv('ISO-8859-1','ASCII//TRANSLIT',$val);
or
$val = iconv('UTF-8','ASCII//TRANSLIT',$val);
note that php has some weird bug in that it (sometimes?) needs to have a locale set to make these conversions work, using setlocale().
edit tested, it gets all of your diacritics out of the box:
$val = "á|â|à|å|ä ð|é|ê|è|ë í|î|ì|ï ó|ô|ò|ø|õ|ö ú|û|ù|ü æ ç ß abc ABC 123";
echo iconv('UTF-8','ASCII//TRANSLIT',$val);
output (updated 2019-12-30)
a|a|a|a|a d|e|e|e|e i|i|i|i o|o|o|o|o|o u|u|u|u ae c ss abc ABC 123
Note that ð
is correctly transliterated to d
instead of o
, as in the accepted answer.
I just came accross the answer from Lizard which is extremely helpful - especially when you do some sorting. Isn't is beautiful how many chars we need to say mostly the same ;)
If anyone else if looking for a all-in solution (as far as the comments above tell), here's the copy&paste:
/**
* Replace language-specific characters by ASCII-equivalents.
* @param string $s
* @return string
*/
public static function normalizeChars($s) {
$replace = array(
'ъ'=>'-', 'Ь'=>'-', 'Ъ'=>'-', 'ь'=>'-',
'Ă'=>'A', 'Ą'=>'A', 'À'=>'A', 'Ã'=>'A', 'Á'=>'A', 'Æ'=>'A', 'Â'=>'A', 'Å'=>'A', 'Ä'=>'Ae',
'Þ'=>'B',
'Ć'=>'C', 'ץ'=>'C', 'Ç'=>'C',
'È'=>'E', 'Ę'=>'E', 'É'=>'E', 'Ë'=>'E', 'Ê'=>'E',
'Ğ'=>'G',
'İ'=>'I', 'Ï'=>'I', 'Î'=>'I', 'Í'=>'I', 'Ì'=>'I',
'Ł'=>'L',
'Ñ'=>'N', 'Ń'=>'N',
'Ø'=>'O', 'Ó'=>'O', 'Ò'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'Oe',
'Ş'=>'S', 'Ś'=>'S', 'Ș'=>'S', 'Š'=>'S',
'Ț'=>'T',
'Ù'=>'U', 'Û'=>'U', 'Ú'=>'U', 'Ü'=>'Ue',
'Ý'=>'Y',
'Ź'=>'Z', 'Ž'=>'Z', 'Ż'=>'Z',
'â'=>'a', 'ǎ'=>'a', 'ą'=>'a', 'á'=>'a', 'ă'=>'a', 'ã'=>'a', 'Ǎ'=>'a', 'а'=>'a', 'А'=>'a', 'å'=>'a', 'à'=>'a', 'א'=>'a', 'Ǻ'=>'a', 'Ā'=>'a', 'ǻ'=>'a', 'ā'=>'a', 'ä'=>'ae', 'æ'=>'ae', 'Ǽ'=>'ae', 'ǽ'=>'ae',
'б'=>'b', 'ב'=>'b', 'Б'=>'b', 'þ'=>'b',
'ĉ'=>'c', 'Ĉ'=>'c', 'Ċ'=>'c', 'ć'=>'c', 'ç'=>'c', 'ц'=>'c', 'צ'=>'c', 'ċ'=>'c', 'Ц'=>'c', 'Č'=>'c', 'č'=>'c', 'Ч'=>'ch', 'ч'=>'ch',
'ד'=>'d', 'ď'=>'d', 'Đ'=>'d', 'Ď'=>'d', 'đ'=>'d', 'д'=>'d', 'Д'=>'D', 'ð'=>'d',
'є'=>'e', 'ע'=>'e', 'е'=>'e', 'Е'=>'e', 'Ə'=>'e', 'ę'=>'e', 'ĕ'=>'e', 'ē'=>'e', 'Ē'=>'e', 'Ė'=>'e', 'ė'=>'e', 'ě'=>'e', 'Ě'=>'e', 'Є'=>'e', 'Ĕ'=>'e', 'ê'=>'e', 'ə'=>'e', 'è'=>'e', 'ë'=>'e', 'é'=>'e',
'ф'=>'f', 'ƒ'=>'f', 'Ф'=>'f',
'ġ'=>'g', 'Ģ'=>'g', 'Ġ'=>'g', 'Ĝ'=>'g', 'Г'=>'g', 'г'=>'g', 'ĝ'=>'g', 'ğ'=>'g', 'ג'=>'g', 'Ґ'=>'g', 'ґ'=>'g', 'ģ'=>'g',
'ח'=>'h', 'ħ'=>'h', 'Х'=>'h', 'Ħ'=>'h', 'Ĥ'=>'h', 'ĥ'=>'h', 'х'=>'h', 'ה'=>'h',
'î'=>'i', 'ï'=>'i', 'í'=>'i', 'ì'=>'i', 'į'=>'i', 'ĭ'=>'i', 'ı'=>'i', 'Ĭ'=>'i', 'И'=>'i', 'ĩ'=>'i', 'ǐ'=>'i', 'Ĩ'=>'i', 'Ǐ'=>'i', 'и'=>'i', 'Į'=>'i', 'י'=>'i', 'Ї'=>'i', 'Ī'=>'i', 'І'=>'i', 'ї'=>'i', 'і'=>'i', 'ī'=>'i', 'ij'=>'ij', 'IJ'=>'ij',
'й'=>'j', 'Й'=>'j', 'Ĵ'=>'j', 'ĵ'=>'j', 'я'=>'ja', 'Я'=>'ja', 'Э'=>'je', 'э'=>'je', 'ё'=>'jo', 'Ё'=>'jo', 'ю'=>'ju', 'Ю'=>'ju',
'ĸ'=>'k', 'כ'=>'k', 'Ķ'=>'k', 'К'=>'k', 'к'=>'k', 'ķ'=>'k', 'ך'=>'k',
'Ŀ'=>'l', 'ŀ'=>'l', 'Л'=>'l', 'ł'=>'l', 'ļ'=>'l', 'ĺ'=>'l', 'Ĺ'=>'l', 'Ļ'=>'l', 'л'=>'l', 'Ľ'=>'l', 'ľ'=>'l', 'ל'=>'l',
'מ'=>'m', 'М'=>'m', 'ם'=>'m', 'м'=>'m',
'ñ'=>'n', 'н'=>'n', 'Ņ'=>'n', 'ן'=>'n', 'ŋ'=>'n', 'נ'=>'n', 'Н'=>'n', 'ń'=>'n', 'Ŋ'=>'n', 'ņ'=>'n', 'ʼn'=>'n', 'Ň'=>'n', 'ň'=>'n',
'о'=>'o', 'О'=>'o', 'ő'=>'o', 'õ'=>'o', 'ô'=>'o', 'Ő'=>'o', 'ŏ'=>'o', 'Ŏ'=>'o', 'Ō'=>'o', 'ō'=>'o', 'ø'=>'o', 'ǿ'=>'o', 'ǒ'=>'o', 'ò'=>'o', 'Ǿ'=>'o', 'Ǒ'=>'o', 'ơ'=>'o', 'ó'=>'o', 'Ơ'=>'o', 'œ'=>'oe', 'Œ'=>'oe', 'ö'=>'oe',
'פ'=>'p', 'ף'=>'p', 'п'=>'p', 'П'=>'p',
'ק'=>'q',
'ŕ'=>'r', 'ř'=>'r', 'Ř'=>'r', 'ŗ'=>'r', 'Ŗ'=>'r', 'ר'=>'r', 'Ŕ'=>'r', 'Р'=>'r', 'р'=>'r',
'ș'=>'s', 'с'=>'s', 'Ŝ'=>'s', 'š'=>'s', 'ś'=>'s', 'ס'=>'s', 'ş'=>'s', 'С'=>'s', 'ŝ'=>'s', 'Щ'=>'sch', 'щ'=>'sch', 'ш'=>'sh', 'Ш'=>'sh', 'ß'=>'ss',
'т'=>'t', 'ט'=>'t', 'ŧ'=>'t', 'ת'=>'t', 'ť'=>'t', 'ţ'=>'t', 'Ţ'=>'t', 'Т'=>'t', 'ț'=>'t', 'Ŧ'=>'t', 'Ť'=>'t', '™'=>'tm',
'ū'=>'u', 'у'=>'u', 'Ũ'=>'u', 'ũ'=>'u', 'Ư'=>'u', 'ư'=>'u', 'Ū'=>'u', 'Ǔ'=>'u', 'ų'=>'u', 'Ų'=>'u', 'ŭ'=>'u', 'Ŭ'=>'u', 'Ů'=>'u', 'ů'=>'u', 'ű'=>'u', 'Ű'=>'u', 'Ǖ'=>'u', 'ǔ'=>'u', 'Ǜ'=>'u', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'У'=>'u', 'ǚ'=>'u', 'ǜ'=>'u', 'Ǚ'=>'u', 'Ǘ'=>'u', 'ǖ'=>'u', 'ǘ'=>'u', 'ü'=>'ue',
'в'=>'v', 'ו'=>'v', 'В'=>'v',
'ש'=>'w', 'ŵ'=>'w', 'Ŵ'=>'w',
'ы'=>'y', 'ŷ'=>'y', 'ý'=>'y', 'ÿ'=>'y', 'Ÿ'=>'y', 'Ŷ'=>'y',
'Ы'=>'y', 'ž'=>'z', 'З'=>'z', 'з'=>'z', 'ź'=>'z', 'ז'=>'z', 'ż'=>'z', 'ſ'=>'z', 'Ж'=>'zh', 'ж'=>'zh'
);
return strtr($s, $replace);
}
Note some slight changes regarding the German umlauts (ä => ae)
Edit: Included more characters based on the posting from user3682119 (except for the copyright symbol) and the comment from daker.
In PHP 5.4 the intl
extension provides a new class named Transliterator.
I believe that's the best way to remove diacritics for two reasons:
Transliterator is based on ICU, so you're using the tables of the ICU library. ICU is a great project, developed over the year to provide comprehensive tables and functionalities. Whatever table you want to write yourself, it will never be as complete as the one from ICU.
In UTF-8, characters could be represented differently. For example, the character ñ could be saved as a single (multi-byte) character, or as the combination of characters ˜
(multibyte) and n
. In addition to this, some characters in Unicode are homograph: they look the same while having different codepoints. For this reason it's also important to normalize the string.
Here's a sample code, taken from an old answer of mine:
<?php
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', Transliterator::FORWARD);
$test = ['abcd', 'èe', '€', 'àòùìéëü', 'àòùìéëü', 'tiësto'];
foreach($test as $e) {
$normalized = $transliterator->transliterate($e);
echo $e. ' --> '.$normalized."\n";
}
?>
Result:
abcd --> abcd
èe --> ee
€ --> €
àòùìéëü --> aouieeu
àòùìéëü --> aouieeu
tiësto --> tiesto
The first argument for the Transliterator class performs the removal of diacritics as well as the normalization of the string.
An updated answer based on @BurninLeo's answer
function replace_spec_char($subject) {
$char_map = array(
"ъ" => "-", "ь" => "-", "Ъ" => "-", "Ь" => "-",
"А" => "A", "Ă" => "A", "Ǎ" => "A", "Ą" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "Ǻ" => "A", "Ā" => "A", "א" => "A",
"Б" => "B", "ב" => "B", "Þ" => "B",
"Ĉ" => "C", "Ć" => "C", "Ç" => "C", "Ц" => "C", "צ" => "C", "Ċ" => "C", "Č" => "C", "©" => "C", "ץ" => "C",
"Д" => "D", "Ď" => "D", "Đ" => "D", "ד" => "D", "Ð" => "D",
"È" => "E", "Ę" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "Е" => "E", "Ē" => "E", "Ė" => "E", "Ě" => "E", "Ĕ" => "E", "Є" => "E", "Ə" => "E", "ע" => "E",
"Ф" => "F", "Ƒ" => "F",
"Ğ" => "G", "Ġ" => "G", "Ģ" => "G", "Ĝ" => "G", "Г" => "G", "ג" => "G", "Ґ" => "G",
"ח" => "H", "Ħ" => "H", "Х" => "H", "Ĥ" => "H", "ה" => "H",
"I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "Į" => "I", "Ĭ" => "I", "I" => "I", "И" => "I", "Ĩ" => "I", "Ǐ" => "I", "י" => "I", "Ї" => "I", "Ī" => "I", "І" => "I",
"Й" => "J", "Ĵ" => "J",
"ĸ" => "K", "כ" => "K", "Ķ" => "K", "К" => "K", "ך" => "K",
"Ł" => "L", "Ŀ" => "L", "Л" => "L", "Ļ" => "L", "Ĺ" => "L", "Ľ" => "L", "ל" => "L",
"מ" => "M", "М" => "M", "ם" => "M",
"Ñ" => "N", "Ń" => "N", "Н" => "N", "Ņ" => "N", "ן" => "N", "Ŋ" => "N", "נ" => "N", "ʼn" => "N", "Ň" => "N",
"Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "О" => "O", "Ő" => "O", "Ŏ" => "O", "Ō" => "O", "Ǿ" => "O", "Ǒ" => "O", "Ơ" => "O",
"פ" => "P", "ף" => "P", "П" => "P",
"ק" => "Q",
"Ŕ" => "R", "Ř" => "R", "Ŗ" => "R", "ר" => "R", "Р" => "R", "®" => "R",
"Ş" => "S", "Ś" => "S", "Ș" => "S", "Š" => "S", "С" => "S", "Ŝ" => "S", "ס" => "S",
"Т" => "T", "Ț" => "T", "ט" => "T", "Ŧ" => "T", "ת" => "T", "Ť" => "T", "Ţ" => "T",
"Ù" => "U", "Û" => "U", "Ú" => "U", "Ū" => "U", "У" => "U", "Ũ" => "U", "Ư" => "U", "Ǔ" => "U", "Ų" => "U", "Ŭ" => "U", "Ů" => "U", "Ű" => "U", "Ǖ" => "U", "Ǜ" => "U", "Ǚ" => "U", "Ǘ" => "U",
"В" => "V", "ו" => "V",
"Ý" => "Y", "Ы" => "Y", "Ŷ" => "Y", "Ÿ" => "Y",
"Ź" => "Z", "Ž" => "Z", "Ż" => "Z", "З" => "Z", "ז" => "Z",
"а" => "a", "ă" => "a", "ǎ" => "a", "ą" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "ǻ" => "a", "ā" => "a", "א" => "a",
"б" => "b", "ב" => "b", "þ" => "b",
"ĉ" => "c", "ć" => "c", "ç" => "c", "ц" => "c", "צ" => "c", "ċ" => "c", "č" => "c", "©" => "c", "ץ" => "c",
"Ч" => "ch", "ч" => "ch",
"д" => "d", "ď" => "d", "đ" => "d", "ד" => "d", "ð" => "d",
"è" => "e", "ę" => "e", "é" => "e", "ë" => "e", "ê" => "e", "е" => "e", "ē" => "e", "ė" => "e", "ě" => "e", "ĕ" => "e", "є" => "e", "ə" => "e", "ע" => "e",
"ф" => "f", "ƒ" => "f",
"ğ" => "g", "ġ" => "g", "ģ" => "g", "ĝ" => "g", "г" => "g", "ג" => "g", "ґ" => "g",
"ח" => "h", "ħ" => "h", "х" => "h", "ĥ" => "h", "ה" => "h",
"i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "į" => "i", "ĭ" => "i", "ı" => "i", "и" => "i", "ĩ" => "i", "ǐ" => "i", "י" => "i", "ї" => "i", "ī" => "i", "і" => "i",
"й" => "j", "Й" => "j", "Ĵ" => "j", "ĵ" => "j",
"ĸ" => "k", "כ" => "k", "ķ" => "k", "к" => "k", "ך" => "k",
"ł" => "l", "ŀ" => "l", "л" => "l", "ļ" => "l", "ĺ" => "l", "ľ" => "l", "ל" => "l",
"מ" => "m", "м" => "m", "ם" => "m",
"ñ" => "n", "ń" => "n", "н" => "n", "ņ" => "n", "ן" => "n", "ŋ" => "n", "נ" => "n", "ʼn" => "n", "ň" => "n",
"ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "о" => "o", "ő" => "o", "ŏ" => "o", "ō" => "o", "ǿ" => "o", "ǒ" => "o", "ơ" => "o",
"פ" => "p", "ף" => "p", "п" => "p",
"ק" => "q",
"ŕ" => "r", "ř" => "r", "ŗ" => "r", "ר" => "r", "р" => "r", "®" => "r",
"ş" => "s", "ś" => "s", "ș" => "s", "š" => "s", "с" => "s", "ŝ" => "s", "ס" => "s",
"т" => "t", "ț" => "t", "ט" => "t", "ŧ" => "t", "ת" => "t", "ť" => "t", "ţ" => "t",
"ù" => "u", "û" => "u", "ú" => "u", "ū" => "u", "у" => "u", "ũ" => "u", "ư" => "u", "ǔ" => "u", "ų" => "u", "ŭ" => "u", "ů" => "u", "ű" => "u", "ǖ" => "u", "ǜ" => "u", "ǚ" => "u", "ǘ" => "u",
"в" => "v", "ו" => "v",
"ý" => "y", "ы" => "y", "ŷ" => "y", "ÿ" => "y",
"ź" => "z", "ž" => "z", "ż" => "z", "з" => "z", "ז" => "z", "ſ" => "z",
"™" => "tm",
"@" => "at",
"Ä" => "ae", "Ǽ" => "ae", "ä" => "ae", "æ" => "ae", "ǽ" => "ae",
"ij" => "ij", "IJ" => "ij",
"я" => "ja", "Я" => "ja",
"Э" => "je", "э" => "je",
"ё" => "jo", "Ё" => "jo",
"ю" => "ju", "Ю" => "ju",
"œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
"щ" => "sch", "Щ" => "sch",
"ш" => "sh", "Ш" => "sh",
"ß" => "ss",
"Ü" => "ue",
"Ж" => "zh", "ж" => "zh",
);
return strtr($subject, $char_map);
}
$string = "Ħí ŧħə®ë, юßť å test!";
echo replace_spec_char($string);
Ħí ŧħə®ë, юßť å test!
=>
Hi there, jusst a test!
This does not mix up upper and lower case chars except for longer chars (eg: ss,ch, sch) , added @ ® ©
Also if you want to build regex matching regardless to special chars :
rss => '[rŕřŘŗŖרŔРр](?:[sșсŜšśסşСŝ][sșсŜšśסşСŝ]|[ß])'
A vala implementation of this : https://code.launchpad.net/~jeremy-munsch/synapse-project/ascii-smart/+merge/277477
Here is the base list you could work with, with regex replacing (in sublime text) or small script you can build anything from this array to fill your needs.
"-" => "ъьЪЬ",
"A" => "АĂǍĄÀÃÁÆÂÅǺĀא",
"B" => "БבÞ",
"C" => "ĈĆÇЦצĊČ©ץ",
"D" => "ДĎĐדÐ",
"E" => "ÈĘÉËÊЕĒĖĚĔЄƏע",
"F" => "ФƑ",
"G" => "ĞĠĢĜГגҐ",
"H" => "חĦХĤה",
"I" => "IÏÎÍÌĮĬIИĨǏיЇĪІ",
"J" => "ЙĴ",
"K" => "ĸכĶКך",
"L" => "ŁĿЛĻĹĽל",
"M" => "מМם",
"N" => "ÑŃНŅןŊנʼnŇ",
"O" => "ØÓÒÔÕОŐŎŌǾǑƠ",
"P" => "פףП",
"Q" => "ק",
"R" => "ŔŘŖרР®",
"S" => "ŞŚȘŠСŜס",
"T" => "ТȚטŦתŤŢ",
"U" => "ÙÛÚŪУŨƯǓŲŬŮŰǕǛǙǗ",
"V" => "Вו",
"Y" => "ÝЫŶŸ",
"Z" => "ŹŽŻЗז",
"a" => "аăǎąàãáæâåǻāא",
"b" => "бבþ",
"c" => "ĉćçцצċč©ץ",
"ch" => "ч",
"d" => "дďđדð",
"e" => "èęéëêеēėěĕєəע",
"f" => "фƒ",
"g" => "ğġģĝгגґ",
"h" => "חħхĥה",
"i" => "iïîíìįĭıиĩǐיїīі",
"j" => "йĵ",
"k" => "ĸכķкך",
"l" => "łŀлļĺľל",
"m" => "מмם",
"n" => "ñńнņןŋנʼnň",
"o" => "øóòôõоőŏōǿǒơ",
"p" => "פףп",
"q" => "ק",
"r" => "ŕřŗרр®",
"s" => "şśșšсŝס",
"t" => "тțטŧתťţ",
"u" => "ùûúūуũưǔųŭůűǖǜǚǘ",
"v" => "вו",
"y" => "ýыŷÿ",
"z" => "źžżзזſ",
"tm" => "™",
"at" => "@",
"ae" => "ÄǼäæǽ",
"ch" => "Чч",
"ij" => "ijIJ",
"j" => "йЙĴĵ",
"ja" => "яЯ",
"je" => "Ээ",
"jo" => "ёЁ",
"ju" => "юЮ",
"oe" => "œŒöÖ",
"sch" => "щЩ",
"sh" => "шШ",
"ss" => "ß",
"tm" => "™",
"ue" => "Ü",
"zh" => "Жж"
I found this way to be a good one, without having to worry too much about charsets and arrays, or iconv:
function replace_accents($str) {
$str = htmlentities($str, ENT_COMPAT, "UTF-8");
$str = preg_replace('/&([a-zA-Z])(uml|acute|grave|circ|tilde|ring);/','$1',$str);
return html_entity_decode($str);
}
So I found this on php.net page for preg_replace function
// replace accented chars
$string = "Zacarías Ferreíra"; // my definition for string variable
$accents = '/&([A-Za-z]{1,2})(grave|acute|circ|cedil|uml|lig);/';
$string_encoded = htmlentities($string,ENT_NOQUOTES,'UTF-8');
$string = preg_replace($accents,'$1',$string_encoded);
If you have encoding issues you may get someting like this "ZacarÃÂas FerreÃÂra", just decode the string and use said code above
$string = utf8_decode("ZacarÃÂas FerreÃÂra");
This worked for me:
<?php
setlocale(LC_ALL, "en_US.utf8");
$val = iconv('UTF-8','ASCII//TRANSLIT',$val);
?>
if you have http://php.net/manual/en/book.intl.php available, this will solve your problem:
$string = "Éric Cantona";
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);
To install the php extension in ubuntu:
apt-get install php-intl
Don't forget, in composer, to require the extension ext-intl
to ensure it properly fits into deployed systems.
protected $_convertTable = array(
'&' => 'and', '@' => 'at', '©' => 'c', '®' => 'r', 'À' => 'a',
'Á' => 'a', 'Â' => 'a', 'Ä' => 'a', 'Å' => 'a', 'Æ' => 'ae','Ç' => 'c',
'È' => 'e', 'É' => 'e', 'Ë' => 'e', 'Ì' => 'i', 'Í' => 'i', 'Î' => 'i',
'Ï' => 'i', 'Ò' => 'o', 'Ó' => 'o', 'Ô' => 'o', 'Õ' => 'o', 'Ö' => 'o',
'Ø' => 'o', 'Ù' => 'u', 'Ú' => 'u', 'Û' => 'u', 'Ü' => 'u', 'Ý' => 'y',
'ß' => 'ss','à' => 'a', 'á' => 'a', 'â' => 'a', 'ä' => 'a', 'å' => 'a',
'æ' => 'ae','ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ò' => 'o', 'ó' => 'o',
'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'ù' => 'u', 'ú' => 'u',
'û' => 'u', 'ü' => 'u', 'ý' => 'y', 'þ' => 'p', 'ÿ' => 'y', 'Ā' => 'a',
'ā' => 'a', 'Ă' => 'a', 'ă' => 'a', 'Ą' => 'a', 'ą' => 'a', 'Ć' => 'c',
'ć' => 'c', 'Ĉ' => 'c', 'ĉ' => 'c', 'Ċ' => 'c', 'ċ' => 'c', 'Č' => 'c',
'č' => 'c', 'Ď' => 'd', 'ď' => 'd', 'Đ' => 'd', 'đ' => 'd', 'Ē' => 'e',
'ē' => 'e', 'Ĕ' => 'e', 'ĕ' => 'e', 'Ė' => 'e', 'ė' => 'e', 'Ę' => 'e',
'ę' => 'e', 'Ě' => 'e', 'ě' => 'e', 'Ĝ' => 'g', 'ĝ' => 'g', 'Ğ' => 'g',
'ğ' => 'g', 'Ġ' => 'g', 'ġ' => 'g', 'Ģ' => 'g', 'ģ' => 'g', 'Ĥ' => 'h',
'ĥ' => 'h', 'Ħ' => 'h', 'ħ' => 'h', 'Ĩ' => 'i', 'ĩ' => 'i', 'Ī' => 'i',
'ī' => 'i', 'Ĭ' => 'i', 'ĭ' => 'i', 'Į' => 'i', 'į' => 'i', 'İ' => 'i',
'ı' => 'i', 'IJ' => 'ij','ij' => 'ij','Ĵ' => 'j', 'ĵ' => 'j', 'Ķ' => 'k',
'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'l', 'ĺ' => 'l', 'Ļ' => 'l', 'ļ' => 'l',
'Ľ' => 'l', 'ľ' => 'l', 'Ŀ' => 'l', 'ŀ' => 'l', 'Ł' => 'l', 'ł' => 'l',
'Ń' => 'n', 'ń' => 'n', 'Ņ' => 'n', 'ņ' => 'n', 'Ň' => 'n', 'ň' => 'n',
'ʼn' => 'n', 'Ŋ' => 'n', 'ŋ' => 'n', 'Ō' => 'o', 'ō' => 'o', 'Ŏ' => 'o',
'ŏ' => 'o', 'Ő' => 'o', 'ő' => 'o', 'Œ' => 'oe','œ' => 'oe','Ŕ' => 'r',
'ŕ' => 'r', 'Ŗ' => 'r', 'ŗ' => 'r', 'Ř' => 'r', 'ř' => 'r', 'Ś' => 's',
'ś' => 's', 'Ŝ' => 's', 'ŝ' => 's', 'Ş' => 's', 'ş' => 's', 'Š' => 's',
'š' => 's', 'Ţ' => 't', 'ţ' => 't', 'Ť' => 't', 'ť' => 't', 'Ŧ' => 't',
'ŧ' => 't', 'Ũ' => 'u', 'ũ' => 'u', 'Ū' => 'u', 'ū' => 'u', 'Ŭ' => 'u',
'ŭ' => 'u', 'Ů' => 'u', 'ů' => 'u', 'Ű' => 'u', 'ű' => 'u', 'Ų' => 'u',
'ų' => 'u', 'Ŵ' => 'w', 'ŵ' => 'w', 'Ŷ' => 'y', 'ŷ' => 'y', 'Ÿ' => 'y',
'Ź' => 'z', 'ź' => 'z', 'Ż' => 'z', 'ż' => 'z', 'Ž' => 'z', 'ž' => 'z',
'ſ' => 'z', 'Ə' => 'e', 'ƒ' => 'f', 'Ơ' => 'o', 'ơ' => 'o', 'Ư' => 'u',
'ư' => 'u', 'Ǎ' => 'a', 'ǎ' => 'a', 'Ǐ' => 'i', 'ǐ' => 'i', 'Ǒ' => 'o',
'ǒ' => 'o', 'Ǔ' => 'u', 'ǔ' => 'u', 'Ǖ' => 'u', 'ǖ' => 'u', 'Ǘ' => 'u',
'ǘ' => 'u', 'Ǚ' => 'u', 'ǚ' => 'u', 'Ǜ' => 'u', 'ǜ' => 'u', 'Ǻ' => 'a',
'ǻ' => 'a', 'Ǽ' => 'ae','ǽ' => 'ae','Ǿ' => 'o', 'ǿ' => 'o', 'ə' => 'e',
'Ё' => 'jo','Є' => 'e', 'І' => 'i', 'Ї' => 'i', 'А' => 'a', 'Б' => 'b',
'В' => 'v', 'Г' => 'g', 'Д' => 'd', 'Е' => 'e', 'Ж' => 'zh','З' => 'z',
'И' => 'i', 'Й' => 'j', 'К' => 'k', 'Л' => 'l', 'М' => 'm', 'Н' => 'n',
'О' => 'o', 'П' => 'p', 'Р' => 'r', 'С' => 's', 'Т' => 't', 'У' => 'u',
'Ф' => 'f', 'Х' => 'h', 'Ц' => 'c', 'Ч' => 'ch','Ш' => 'sh','Щ' => 'sch',
'Ъ' => '-', 'Ы' => 'y', 'Ь' => '-', 'Э' => 'je','Ю' => 'ju','Я' => 'ja',
'а' => 'a', 'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e',
'ж' => 'zh','з' => 'z', 'и' => 'i', 'й' => 'j', 'к' => 'k', 'л' => 'l',
'м' => 'm', 'н' => 'n', 'о' => 'o', 'п' => 'p', 'р' => 'r', 'с' => 's',
'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c', 'ч' => 'ch',
'ш' => 'sh','щ' => 'sch','ъ' => '-','ы' => 'y', 'ь' => '-', 'э' => 'je',
'ю' => 'ju','я' => 'ja','ё' => 'jo','є' => 'e', 'і' => 'i', 'ї' => 'i',
'Ґ' => 'g', 'ґ' => 'g', 'א' => 'a', 'ב' => 'b', 'ג' => 'g', 'ד' => 'd',
'ה' => 'h', 'ו' => 'v', 'ז' => 'z', 'ח' => 'h', 'ט' => 't', 'י' => 'i',
'ך' => 'k', 'כ' => 'k', 'ל' => 'l', 'ם' => 'm', 'מ' => 'm', 'ן' => 'n',
'נ' => 'n', 'ס' => 's', 'ע' => 'e', 'ף' => 'p', 'פ' => 'p', 'ץ' => 'C',
'צ' => 'c', 'ק' => 'q', 'ר' => 'r', 'ש' => 'w', 'ת' => 't', '™' => 'tm',
);
From magento, im using it for basically everything
I've searched and your idea for accent striping is quite awesome and cost-effective but your regex is wrongly done and misses 2 extra params. Long story short the regex must be:
$patterns[0] = '/[áâàåä]/ui';
$patterns[1] = '/[ðéêèë]/ui';
$patterns[2] = '/[íîìï]/ui';
$patterns[3] = '/[óôòøõö]/ui';
$patterns[4] = '/[úûùü]/ui';
$patterns[5] = '/æ/ui';
$patterns[6] = '/ç/ui';
$patterns[7] = '/ß/ui';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';
As you can see is quite similar but the most important thing is the paramas after the second slash of the regular expression. When a regualr expression is like this /[someCoolRegex]/ui
the u
specifies that it must use unicode and the i
specifies that is case insensitive, I've tested my own and with the ansewer in this forum I must say is more cost efective than using strtr.
Hope someone reads this answer.
Disclaimer: I'm not supporting this answer anymore (I was blind at that time). But thanks for the up-votes =P
You can take this as basis. From WordPress, used to generate pretty urls (the entry point is the slugify() function):
/**
* Converts all accent characters to ASCII characters.
*
* If there are no accent characters, then the string given is just returned.
*
* @param string $string Text that might have accent characters
* @return string Filtered string with replaced "nice" characters.
*/
function remove_accents($string) {
if (!preg_match('/[\x80-\xff]/', $string))
return $string;
if (seems_utf8($string)) {
$chars = array(
// Decompositions for Latin-1 Supplement
chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
chr(195).chr(191) => 'y',
// Decompositions for Latin Extended-A
chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
// Euro Sign
chr(226).chr(130).chr(172) => 'E',
// GBP (Pound) Sign
chr(194).chr(163) => '');
$string = strtr($string, $chars);
} else {
// Assume ISO-8859-1 if not UTF-8
$chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
.chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
.chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
.chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
.chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
.chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
.chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
.chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
.chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
.chr(252).chr(253).chr(255);
$chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
$string = strtr($string, $chars['in'], $chars['out']);
$double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
$double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
$string = str_replace($double_chars['in'], $double_chars['out'], $string);
}
return $string;
}
/**
* Checks to see if a string is utf8 encoded.
*
* @author bmorel at ssi dot fr
*
* @param string $Str The string to be checked
* @return bool True if $Str fits a UTF-8 model, false otherwise.
*/
function seems_utf8($Str) { # by bmorel at ssi dot fr
$length = strlen($Str);
for ($i = 0; $i < $length; $i++) {
if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n = 1; # 110bbbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n = 2; # 1110bbbb
elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n = 3; # 11110bbb
elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n = 4; # 111110bb
elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n = 5; # 1111110b
else return false; # Does not match any model
for ($j = 0; $j < $n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($Str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
function utf8_uri_encode($utf8_string, $length = 0) {
$unicode = '';
$values = array();
$num_octets = 1;
$unicode_length = 0;
$string_length = strlen($utf8_string);
for ($i = 0; $i < $string_length; $i++) {
$value = ord($utf8_string[$i]);
if ($value < 128) {
if ($length && ($unicode_length >= $length))
break;
$unicode .= chr($value);
$unicode_length++;
} else {
if (count($values) == 0) $num_octets = ($value < 224) ? 2 : 3;
$values[] = $value;
if ($length && ($unicode_length + ($num_octets * 3)) > $length)
break;
if (count( $values ) == $num_octets) {
if ($num_octets == 3) {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
$unicode_length += 9;
} else {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
$unicode_length += 6;
}
$values = array();
$num_octets = 1;
}
}
}
return $unicode;
}
/**
* Sanitizes title, replacing whitespace with dashes.
*
* Limits the output to alphanumeric characters, underscore (_) and dash (-).
* Whitespace becomes a dash.
*
* @param string $title The title to be sanitized.
* @return string The sanitized title.
*/
function slugify($title) {
$title = strip_tags($title);
// Preserve escaped octets.
$title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
// Remove percent signs that are not part of an octet.
$title = str_replace('%', '', $title);
// Restore octets.
$title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
$title = remove_accents($title);
if (seems_utf8($title)) {
if (function_exists('mb_strtolower')) {
$title = mb_strtolower($title, 'UTF-8');
}
$title = utf8_uri_encode($title, 200);
}
$title = strtolower($title);
$title = preg_replace('/&.+?;/', '', $title); // kill entities
$title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
$title = preg_replace('/\s+/', '-', $title);
$title = preg_replace('|-+|', '-', $title);
$title = trim($title, '-');
return $title;
}
strtolower
only works on iso-8859-1 encoded strings. You could try with mb_strtolower
.
Or, if you have to mangle with multibyte-extensions, you might as well use iconv's transliteration support:
iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text);
Edit:
It seems I was a bit fast. You appear to use iso-8859-1, so your current strategy will work. You just need to write the regexp's properly. Eg.:
'/(ð|é|ê|è|ë)/'
not:
'/[ð|é|ê|è|ë]/'
You can use PHP strtr() function to get rid of accented characters :
$string = "Éric Cantona";
$accented_array = array('Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E','Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U','Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c','è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o','ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$required_str = strtr( $string, $accented_array );
As an alternative (a bit more complex in nature through), have a look at how wordpress does accent removal. Made some changes below to make it run independently without referencing wordpress functions.
function mbstring_binary_safe_encoding($reset = false)
{
static $encodings = array();
static $overloaded = null;
if (is_null($overloaded)) {
$overloaded = function_exists('mb_internal_encoding') && (ini_get('mbstring.func_overload') & 2);
}
if (false === $overloaded) {
return;
}
if (!$reset) {
$encoding = mb_internal_encoding();
array_push($encodings, $encoding);
mb_internal_encoding('ISO-8859-1');
}
if ($reset && $encodings) {
$encoding = array_pop($encodings);
mb_internal_encoding($encoding);
}
}
function seems_utf8($str)
{
mbstring_binary_safe_encoding();
$length = strlen($str);
mbstring_binary_safe_encoding(true);
for ($i = 0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) {
$n = 0;
}
// 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) {
$n = 1;
}
// 110bbbbb
elseif (($c & 0xF0) == 0xE0) {
$n = 2;
}
// 1110bbbb
elseif (($c & 0xF8) == 0xF0) {
$n = 3;
}
// 11110bbb
elseif (($c & 0xFC) == 0xF8) {
$n = 4;
}
// 111110bb
elseif (($c & 0xFE) == 0xFC) {
$n = 5;
}
// 1111110b
else {
return false;
}
// Does not match any model
for ($j = 0; $j < $n; $j++) {
// n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) {
return false;
}
}
}
return true;
}
function remove_accents($string)
{
if (!preg_match('/[\x80-\xff]/', $string)) {
return $string;
}
if (seems_utf8($string)) {
$chars = array(
// Decompositions for Latin-1 Supplement
'ª' => 'a', 'º' => 'o',
'À' => 'A', 'Á' => 'A',
'Â' => 'A', 'Ã' => 'A',
'Ä' => 'A', 'Å' => 'A',
'Æ' => 'AE', 'Ç' => 'C',
'È' => 'E', 'É' => 'E',
'Ê' => 'E', 'Ë' => 'E',
'Ì' => 'I', 'Í' => 'I',
'Î' => 'I', 'Ï' => 'I',
'Ð' => 'D', 'Ñ' => 'N',
'Ò' => 'O', 'Ó' => 'O',
'Ô' => 'O', 'Õ' => 'O',
'Ö' => 'O', 'Ù' => 'U',
'Ú' => 'U', 'Û' => 'U',
'Ü' => 'U', 'Ý' => 'Y',
'Þ' => 'TH', 'ß' => 's',
'à' => 'a', 'á' => 'a',
'â' => 'a', 'ã' => 'a',
'ä' => 'a', 'å' => 'a',
'æ' => 'ae', 'ç' => 'c',
'è' => 'e', 'é' => 'e',
'ê' => 'e', 'ë' => 'e',
'ì' => 'i', 'í' => 'i',
'î' => 'i', 'ï' => 'i',
'ð' => 'd', 'ñ' => 'n',
'ò' => 'o', 'ó' => 'o',
'ô' => 'o', 'õ' => 'o',
'ö' => 'o', 'ø' => 'o',
'ù' => 'u', 'ú' => 'u',
'û' => 'u', 'ü' => 'u',
'ý' => 'y', 'þ' => 'th',
'ÿ' => 'y', 'Ø' => 'O',
// Decompositions for Latin Extended-A
'Ā' => 'A', 'ā' => 'a',
'Ă' => 'A', 'ă' => 'a',
'Ą' => 'A', 'ą' => 'a',
'Ć' => 'C', 'ć' => 'c',
'Ĉ' => 'C', 'ĉ' => 'c',
'Ċ' => 'C', 'ċ' => 'c',
'Č' => 'C', 'č' => 'c',
'Ď' => 'D', 'ď' => 'd',
'Đ' => 'D', 'đ' => 'd',
'Ē' => 'E', 'ē' => 'e',
'Ĕ' => 'E', 'ĕ' => 'e',
'Ė' => 'E', 'ė' => 'e',
'Ę' => 'E', 'ę' => 'e',
'Ě' => 'E', 'ě' => 'e',
'Ĝ' => 'G', 'ĝ' => 'g',
'Ğ' => 'G', 'ğ' => 'g',
'Ġ' => 'G', 'ġ' => 'g',
'Ģ' => 'G', 'ģ' => 'g',
'Ĥ' => 'H', 'ĥ' => 'h',
'Ħ' => 'H', 'ħ' => 'h',
'Ĩ' => 'I', 'ĩ' => 'i',
'Ī' => 'I', 'ī' => 'i',
'Ĭ' => 'I', 'ĭ' => 'i',
'Į' => 'I', 'į' => 'i',
'İ' => 'I', 'ı' => 'i',
'IJ' => 'IJ', 'ij' => 'ij',
'Ĵ' => 'J', 'ĵ' => 'j',
'Ķ' => 'K', 'ķ' => 'k',
'ĸ' => 'k', 'Ĺ' => 'L',
'ĺ' => 'l', 'Ļ' => 'L',
'ļ' => 'l', 'Ľ' => 'L',
'ľ' => 'l', 'Ŀ' => 'L',
'ŀ' => 'l', 'Ł' => 'L',
'ł' => 'l', 'Ń' => 'N',
'ń' => 'n', 'Ņ' => 'N',
'ņ' => 'n', 'Ň' => 'N',
'ň' => 'n', 'ʼn' => 'n',
'Ŋ' => 'N', 'ŋ' => 'n',
'Ō' => 'O', 'ō' => 'o',
'Ŏ' => 'O', 'ŏ' => 'o',
'Ő' => 'O', 'ő' => 'o',
'Œ' => 'OE', 'œ' => 'oe',
'Ŕ' => 'R', 'ŕ' => 'r',
'Ŗ' => 'R', 'ŗ' => 'r',
'Ř' => 'R', 'ř' => 'r',
'Ś' => 'S', 'ś' => 's',
'Ŝ' => 'S', 'ŝ' => 's',
'Ş' => 'S', 'ş' => 's',
'Š' => 'S', 'š' => 's',
'Ţ' => 'T', 'ţ' => 't',
'Ť' => 'T', 'ť' => 't',
'Ŧ' => 'T', 'ŧ' => 't',
'Ũ' => 'U', 'ũ' => 'u',
'Ū' => 'U', 'ū' => 'u',
'Ŭ' => 'U', 'ŭ' => 'u',
'Ů' => 'U', 'ů' => 'u',
'Ű' => 'U', 'ű' => 'u',
'Ų' => 'U', 'ų' => 'u',
'Ŵ' => 'W', 'ŵ' => 'w',
'Ŷ' => 'Y', 'ŷ' => 'y',
'Ÿ' => 'Y', 'Ź' => 'Z',
'ź' => 'z', 'Ż' => 'Z',
'ż' => 'z', 'Ž' => 'Z',
'ž' => 'z', 'ſ' => 's',
// Decompositions for Latin Extended-B
'Ș' => 'S', 'ș' => 's',
'Ț' => 'T', 'ț' => 't',
// Euro Sign
'€' => 'E',
// GBP (Pound) Sign
'£' => '',
// Vowels with diacritic (Vietnamese)
// unmarked
'Ơ' => 'O', 'ơ' => 'o',
'Ư' => 'U', 'ư' => 'u',
// grave accent
'Ầ' => 'A', 'ầ' => 'a',
'Ằ' => 'A', 'ằ' => 'a',
'Ề' => 'E', 'ề' => 'e',
'Ồ' => 'O', 'ồ' => 'o',
'Ờ' => 'O', 'ờ' => 'o',
'Ừ' => 'U', 'ừ' => 'u',
'Ỳ' => 'Y', 'ỳ' => 'y',
// hook
'Ả' => 'A', 'ả' => 'a',
'Ẩ' => 'A', 'ẩ' => 'a',
'Ẳ' => 'A', 'ẳ' => 'a',
'Ẻ' => 'E', 'ẻ' => 'e',
'Ể' => 'E', 'ể' => 'e',
'Ỉ' => 'I', 'ỉ' => 'i',
'Ỏ' => 'O', 'ỏ' => 'o',
'Ổ' => 'O', 'ổ' => 'o',
'Ở' => 'O', 'ở' => 'o',
'Ủ' => 'U', 'ủ' => 'u',
'Ử' => 'U', 'ử' => 'u',
'Ỷ' => 'Y', 'ỷ' => 'y',
// tilde
'Ẫ' => 'A', 'ẫ' => 'a',
'Ẵ' => 'A', 'ẵ' => 'a',
'Ẽ' => 'E', 'ẽ' => 'e',
'Ễ' => 'E', 'ễ' => 'e',
'Ỗ' => 'O', 'ỗ' => 'o',
'Ỡ' => 'O', 'ỡ' => 'o',
'Ữ' => 'U', 'ữ' => 'u',
'Ỹ' => 'Y', 'ỹ' => 'y',
// acute accent
'Ấ' => 'A', 'ấ' => 'a',
'Ắ' => 'A', 'ắ' => 'a',
'Ế' => 'E', 'ế' => 'e',
'Ố' => 'O', 'ố' => 'o',
'Ớ' => 'O', 'ớ' => 'o',
'Ứ' => 'U', 'ứ' => 'u',
// dot below
'Ạ' => 'A', 'ạ' => 'a',
'Ậ' => 'A', 'ậ' => 'a',
'Ặ' => 'A', 'ặ' => 'a',
'Ẹ' => 'E', 'ẹ' => 'e',
'Ệ' => 'E', 'ệ' => 'e',
'Ị' => 'I', 'ị' => 'i',
'Ọ' => 'O', 'ọ' => 'o',
'Ộ' => 'O', 'ộ' => 'o',
'Ợ' => 'O', 'ợ' => 'o',
'Ụ' => 'U', 'ụ' => 'u',
'Ự' => 'U', 'ự' => 'u',
'Ỵ' => 'Y', 'ỵ' => 'y',
// Vowels with diacritic (Chinese, Hanyu Pinyin)
'ɑ' => 'a',
// macron
'Ǖ' => 'U', 'ǖ' => 'u',
// acute accent
'Ǘ' => 'U', 'ǘ' => 'u',
// caron
'Ǎ' => 'A', 'ǎ' => 'a',
'Ǐ' => 'I', 'ǐ' => 'i',
'Ǒ' => 'O', 'ǒ' => 'o',
'Ǔ' => 'U', 'ǔ' => 'u',
'Ǚ' => 'U', 'ǚ' => 'u',
// grave accent
'Ǜ' => 'U', 'ǜ' => 'u',
);
$string = strtr($string, $chars);
} else {
$chars = array();
// Assume ISO-8859-1 if not UTF-8
$chars['in'] = "\x80\x83\x8a\x8e\x9a\x9e"
. "\x9f\xa2\xa5\xb5\xc0\xc1\xc2"
. "\xc3\xc4\xc5\xc7\xc8\xc9\xca"
. "\xcb\xcc\xcd\xce\xcf\xd1\xd2"
. "\xd3\xd4\xd5\xd6\xd8\xd9\xda"
. "\xdb\xdc\xdd\xe0\xe1\xe2\xe3"
. "\xe4\xe5\xe7\xe8\xe9\xea\xeb"
. "\xec\xed\xee\xef\xf1\xf2\xf3"
. "\xf4\xf5\xf6\xf8\xf9\xfa\xfb"
. "\xfc\xfd\xff";
$chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
$string = strtr($string, $chars['in'], $chars['out']);
$double_chars = array();
$double_chars['in'] = array("\x8c", "\x9c", "\xc6", "\xd0", "\xde", "\xdf", "\xe6", "\xf0", "\xfe");
$double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
$string = str_replace($double_chars['in'], $double_chars['out'], $string);
}
return $string;
}
I know, that question has been asked a long long time ago...
I was looking for a short and elegant solution, but couldn't find satisfaction for two reasons:
First, most of the existing solutions replace a list of characters by a list of other characters. Unfortunately, it require to use a specific encoding for the php script file itself which might be unwanted.
Second, using iconv seems to be a good way, but it's not enough as the result of a converted character could be one or two characters, or a Fatal Exception.
So I wrote that small function which does the job :
function replaceAccent($string, $replacement = '_')
{
$alnumPattern = '/^[a-zA-Z0-9 ]+$/';
if (preg_match($alnumPattern, $string)) {
return $string;
}
$ret = array_map(
function ($chr) use ($alnumPattern, $replacement) {
if (preg_match($alnumPattern, $chr)) {
return $chr;
} else {
$chr = @iconv('ISO-8859-1', 'ASCII//TRANSLIT', $chr);
if (strlen($chr) == 1) {
return $chr;
} elseif (strlen($chr) > 1) {
$ret = '';
foreach (str_split($chr) as $char2) {
if (preg_match($alnumPattern, $char2)) {
$ret .= $char2;
}
}
return $ret;
} else {
// replace whatever iconv fail to convert by something else
return $replacement;
}
}
},
str_split($string)
);
return implode($ret);
}
Vietnamese characters for those who need them
'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With