Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange PHP UTF-8 Behaviour

I have the following test PHP code:

header('Content-type: text/html; charset=utf-8');

$text = 'Développeur Web';
var_dump($text);

$text = preg_replace('#[^\\pL\d]+#u', '-', $text);
var_dump($text);

$text = trim($text, '-');
var_dump($text);

$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
var_dump($text);

$text = strtolower($text);
var_dump($text);

$text = preg_replace('#[^-\w]+#', '', $text);
var_dump($text);

On my local machine it's working as expected:

string(16) "Développeur Web" 
string(16) "Développeur-Web" 
string(16) "Développeur-Web" 
string(16) "D'eveloppeur-Web" 
string(16) "d'eveloppeur-web" 
string(15) "developpeur-web" 

but on my live server it's behaving strangely:

string 'Développeur Web' (length=16)
string '-pp-' (length=4)
string 'pp' (length=2)
string 'pp' (length=2)
string 'pp' (length=2)
string 'pp' (length=2)

The local machine is Windows running PHP version 5.2.4 and the live server is CentOS running PHP version 5.2.10 so they aren't identical by any means, not ideal I know.

Has anyone experienced anything similar and can point me in the right direction? I'm assuming it's some kind of server or PHP configuration related to UTF-8 or locale.

Many thanks in advance

like image 236
Peter Hough Avatar asked Nov 11 '10 11:11

Peter Hough


1 Answers

Shouldn't it be

$text = preg_replace('#[^\pL\d]+#u', '-', $text);

in line 6. If you escape the \ you'll have a literal \ in your exclusion class. So the regex [^\\pL\d]+ finds one or more occurrences of a character not being a \, p, L or a digit. This would explain why "Développeur Web" will be reduced to "-pp-" - everything up to the first p matches and will be replaced by a -; the same is true for everything after the second p.

Perhaps there is a difference between both machines in how an escaped \ is treated.

EDIT after OP comment:

Actually escaping the \ is no problem here - both versions are treated the same way. What actually seems to be the problem ist, that the used PCRE version does not support unicode properties and wasn't compiled with --enable-unicode-properties.

like image 94
Stefan Gehrig Avatar answered Nov 18 '22 21:11

Stefan Gehrig