Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for word characters in any language

Tags:

regex

php

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!

Note that this is not for security filtering but rather for tokenizing a text.

like image 541
dotancohen Avatar asked Sep 27 '12 16:09

dotancohen


People also ask

Does regex work for other languages?

Short answer: yes.

What does \b mean in regex?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.


1 Answers

Try [\pL_] - see the reference at

http://php.net/manual/en/regexp.reference.unicode.php

like image 156
spiralx Avatar answered Oct 13 '22 21:10

spiralx