Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

preg_replace isn't working for some words/characters

$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<=\b)موسیٰ(?=\b)/u', 'Musa', $str);
$str = preg_replace('/(?<=\b)سنا(?=\b)/u', 'suna', $str);
echo $str;

This fails to replace موسیٰ. It should give کس نے Musa کے بارے میں suna ہے؟ but instead gives کس نے موسیٰ کے بارے میں suna ہے؟.

This is happening for all words that end with a ٰ, like تعالیٰ . It works for words where ٰ is in the middle of the word (no words begin with a ٰ). Does this mean that \b just doesn't work with ٰ? Is it a bug?

like image 355
twharmon Avatar asked Oct 29 '22 07:10

twharmon


1 Answers

The reason is that a word boundary matches in the following positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

The "offending" symbol is U+0670 ARABIC LETTER SUPERSCRIPT ALEF belonging to \p{Mn} (nonspacing mark Unicode category), and is thus a non-word symbol. \b will match if it is preceded with a char belonging to \w (letter, digit, _).

Use unambiguous boundaries, only if the search phrase is not preceded/followed with word chars:

$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<!\w)موسیٰ(?!\w)/u', 'Musa', $str);
$str = preg_replace('/(?<!\w)سنا(?!\w)/u', 'suna', $str);
echo $str; // => کس نے Musa کے بارے میں suna ہے؟

See PHP demo.

The (?<!\w) is a negative lookbehind making sure there is no word char immediately before the subsequent consuming pattern, and (?!\w) is a negative lookahead that makes sure there is no word char immediately after the preceding consuming pattern.

like image 52
Wiktor Stribiżew Avatar answered Nov 15 '22 07:11

Wiktor Stribiżew