Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different regex output on 2 PHP systems?

Tags:

regex

php

unicode

Given this test script:

<?php

echo setlocale(LC_ALL, '') . "\n";

$in = 'Città';

$var = preg_replace('/\s+$/', '', $in);

echo bin2hex($in) . "\n";
echo bin2hex($var) . "\n";

PHP 5.5.3 on Ubuntu, I get:

en_GB.UTF-8
43697474c3a0
43697474c3a0

PHP 5.5.9 on Mac (via Macports)

en_GB.UTF-8
43697474c3a0
43697474c3

Is there any reason why the Macports build will be treating the à character differently?

I'm aware that c3a0, when treated as two bytes in ASCII, is à followed by a non-breaking space. I am wondering why one system treats the 2 bytes as UTF-8 without the u modifier.

like image 330
Tim Jones Avatar asked Nov 10 '22 09:11

Tim Jones


1 Answers

Use the /u modifier:

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8.

By default, the string is treated as a sequence of single-byte characters; the problem is that some of your characters are encoded as multibyte in UTF-8. While 0xc3a0 is a single codepoint, \s will match on its second byte, 0xa0, which is a non-breaking space, and therefore whitespace.

$var = preg_replace('/\s+$/u', '', $in);

should enable UTF-8 mode for matching, and it should work on all systems.

like image 189
Piskvor left the building Avatar answered Nov 14 '22 23:11

Piskvor left the building