Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_]
to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]
. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Short answer: yes.
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.
Try [\pL_]
- see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With