Different regex output on 2 PHP systems?

Question

Given this test script:

<?php

echo setlocale(LC_ALL, '') . "
";

$in = 'Città';

$var = preg_replace('/\s+$/', '', $in);

echo bin2hex($in) . "
";
echo bin2hex($var) . "
";

PHP 5.5.3 on Ubuntu, I get:

en_GB.UTF-8
43697474c3a0
43697474c3a0

PHP 5.5.9 on Mac (via Macports)

en_GB.UTF-8
43697474c3a0
43697474c3

Is there any reason why the Macports build will be treating the à character differently?

I'm aware that c3a0, when treated as two bytes in ASCII, is Ã followed by a non-breaking space. I am wondering why one system treats the 2 bytes as UTF-8 without the u modifier.

Piskvor left the building · Accepted Answer

Use the /u modifier:

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8.

By default, the string is treated as a sequence of single-byte characters; the problem is that some of your characters are encoded as multibyte in UTF-8. While 0xc3a0 is a single codepoint, \s will match on its second byte, 0xa0, which is a non-breaking space, and therefore whitespace.

$var = preg_replace('/\s+$/u', '', $in);

should enable UTF-8 mode for matching, and it should work on all systems.

Different regex output on 2 PHP systems?

Tags:

regex

php

unicode

Tim Jones

1 Answers

Piskvor left the building

Recent Activity

Donate For Us

Different regex output on 2 PHP systems?

Tags:

regex

php

unicode

Tim Jones

1 Answers

Piskvor left the building

Related questions

Recent Activity

Donate For Us