Given this test script:
<?php
echo setlocale(LC_ALL, '') . "\n";
$in = 'Città';
$var = preg_replace('/\s+$/', '', $in);
echo bin2hex($in) . "\n";
echo bin2hex($var) . "\n";
PHP 5.5.3 on Ubuntu, I get:
en_GB.UTF-8
43697474c3a0
43697474c3a0
PHP 5.5.9 on Mac (via Macports)
en_GB.UTF-8
43697474c3a0
43697474c3
Is there any reason why the Macports build will be treating the à
character differently?
I'm aware that c3a0
, when treated as two bytes in ASCII, is Ã
followed by a non-breaking space. I am wondering why one system treats the 2 bytes as UTF-8 without the u
modifier.
Use the /u
modifier:
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8.
By default, the string is treated as a sequence of single-byte characters; the problem is that some of your characters are encoded as multibyte in UTF-8. While 0xc3a0
is a single codepoint, \s
will match on its second byte, 0xa0
, which is a non-breaking space, and therefore whitespace.
$var = preg_replace('/\s+$/u', '', $in);
should enable UTF-8 mode for matching, and it should work on all systems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With