$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<=\b)موسیٰ(?=\b)/u', 'Musa', $str);
$str = preg_replace('/(?<=\b)سنا(?=\b)/u', 'suna', $str);
echo $str;
This fails to replace موسیٰ
. It should give کس نے Musa کے بارے میں suna ہے؟
but instead gives کس نے موسیٰ کے بارے میں suna ہے؟
.
This is happening for all words that end with a ٰ
, like تعالیٰ
. It works for words where ٰ
is in the middle of the word (no words begin with a ٰ
). Does this mean that \b
just doesn't work with ٰ
? Is it a bug?
The reason is that a word boundary matches in the following positions:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
The "offending" symbol is U+0670
ARABIC LETTER SUPERSCRIPT ALEF
belonging to \p{Mn}
(nonspacing mark Unicode category), and is thus a non-word symbol. \b
will match if it is preceded with a char belonging to \w
(letter, digit, _
).
Use unambiguous boundaries, only if the search phrase is not preceded/followed with word chars:
$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<!\w)موسیٰ(?!\w)/u', 'Musa', $str);
$str = preg_replace('/(?<!\w)سنا(?!\w)/u', 'suna', $str);
echo $str; // => کس نے Musa کے بارے میں suna ہے؟
See PHP demo.
The (?<!\w)
is a negative lookbehind making sure there is no word char immediately before the subsequent consuming pattern, and (?!\w)
is a negative lookahead that makes sure there is no word char immediately after the preceding consuming pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With