Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

php iconv translit for removing accents: not working as excepted?

consider this simple code:

echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');

it prints

 `e

instead of just

 e

do you know what I am doing wrong?


nothing changed after adding setlocale

setlocale(LC_COLLATE, 'en_US.utf8');
echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');
like image 439
dynamic Avatar asked Feb 06 '11 00:02

dynamic


2 Answers

I have this standard function to return valid url strings without the invalid url characters. The magic seems to be in the line after the //remove unwanted characters comment.

This is taken from the Symfony framework documentation: http://www.symfony-project.org/jobeet/1_4/Doctrine/en/08 which in turn is taken from http://php.vrana.cz/vytvoreni-pratelskeho-url.php but i don't speak Czech ;-)

function slugify($text)
{
  // replace non letter or digits by -
  $text = preg_replace('#[^\\pL\d]+#u', '-', $text);

  // trim
  $text = trim($text, '-');

  // transliterate
  if (function_exists('iconv'))
  {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
  }

  // lowercase
  $text = strtolower($text);

  // remove unwanted characters
  $text = preg_replace('#[^-\w]+#', '', $text);

  if (empty($text))
  {
    return 'n-a';
  }

  return $text;
}

echo slugify('é'); // --> "e"
like image 89
Hidde Avatar answered Nov 15 '22 15:11

Hidde


cf @tchrist, with INTL php extension

http://fr2.php.net/manual/en/book.intl.php

preg_replace('/\pM*/u','',normalizer_normalize( $mystring, Normalizer::FORM_D));

eéèêëiîïoöôuùûüaâäÅ Ἥ ŐǟǠ ǺƶƈƉųŪŧȬƀ␢ĦŁȽŦ ƀǖ becomes

eeeeeiiiooouuuuaaaA Η OaA AƶƈƉuUŧOƀ␢ĦŁȽŦ ƀu


As tchrist emphasises, not all unicode characters are considered decomposable:

extract from Unicode charts:

U0080.pdf

00CF Ï LATIN CAPITAL LETTER I WITH DIAERESIS

≡ 0049 I 0308 ¨

NB this symbol « ≡ » indicate an available decomposition

00D0 Ð LATIN CAPITAL LETTER ETH

→ 00F0 ð latin small letter eth

→ 0110 Đ latin capital letter d with stroke

→ 0189 Ɖ latin capital letter african d

no decomposition available, IMHO strangely (we could consider ASCII letter D as an acceptable equivalent).

U0100.pdf

0110 Đ LATIN CAPITAL LETTER D WITH STROKE

→ 00D0 Ð latin capital letter eth

→ 0111 đ latin small letter d with stroke

→ 0189 Ɖ latin capital letter african d

even stranger: this one is identified as LATIN CAPITAL LETTER D (with stroke), but not decomposable as such! Perhaps a cooler solution should be to get the unicode description of each char, and compare it with the description of each ascii char (and replace accordingly). Anyone? ;-]

cf http://unicode.org/Public/UNIDATA/UnicodeData.txt

like image 24
eleg Avatar answered Nov 15 '22 14:11

eleg