Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

non-breaking utf-8 0xc2a0 space and preg_replace strange behaviour

Tags:

regex

php

In my string I have utf-8 non-breaking space (0xc2a0) and I want to replace it with something else.

When I use

$str=preg_replace('~\xc2\xa0~', 'X', $str);

it works OK.

But when I use

$str=preg_replace('~\x{C2A0}~siu', 'W', $str);

non-breaking space is not found (and replaced).

Why? What is wrong with second regexp?

The format \x{C2A0} is correct, also I used u flag.

like image 837
DamirR Avatar asked Oct 11 '12 10:10

DamirR


2 Answers

Actually the documentation about escape sequences in PHP is wrong. When you use \xc2\xa0 syntax, it searches for UTF-8 character. But with \x{c2a0} syntax, it tries to convert the Unicode sequence to UTF-8 encoded character.

A non breaking space is U+00A0 (Unicode) but encoded as C2A0 in UTF-8. So if you try with the pattern ~\x{00a0}~siu, it will work as expected.

like image 172
Newbo.O Avatar answered Oct 22 '22 14:10

Newbo.O


I've aggegate previous answers so people can just copy / paste following code to choose their favorite method :

$some_text_with_non_breaking_spaces = "some text with 2 non breaking spaces at the beginning";
echo 'Qty non-breaking space : ' . substr_count($some_text_with_non_breaking_spaces, "\xc2\xa0") . '<br>';
echo $some_text_with_non_breaking_spaces . '<br>';

# Method 1 : regular expression
$clean_text = preg_replace('~\x{00a0}~siu', ' ', $some_text_with_non_breaking_spaces);

# Method 2 : convert to bin -> replace -> convert to hex
$clean_text = hex2bin(str_replace('c2a0', '20', bin2hex($some_text_with_non_breaking_spaces)));

# Method 3 : my favorite
$clean_text = str_replace("\xc2\xa0", " ", $some_text_with_non_breaking_spaces);

echo 'Qty non-breaking space : ' . substr_count($clean_text, "\xc2\xa0"). '<br>';
echo $clean_text . '<br>';
like image 13
hugsbrugs Avatar answered Oct 22 '22 14:10

hugsbrugs