I want to use word boundary in a regex for matching some unicode text. Unicode letters are detected as word boundary in Python regex as here:
>>> re.search(r"\by\b","üyü")
<_sre.SRE_Match object at 0x02819E58>
>>> re.search(r"\by\b","ğyğ")
<_sre.SRE_Match object at 0x028250C8>
>>> re.search(r"\by\b","uyu")
>>>
What should I do in order to make the word boundary symbol not match unicode letters?
A word boundary is a zero-width test between two characters. To pass the test, there must be a word character on one side, and a non-word character on the other side. It does not matter which side each character appears on, but there must be one of each.
Word Boundary: \b The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12" , it would match before the 1 or after the 2.
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”.
Use re.UNICODE:
>>> re.search(r"\by\b","üyü", re.UNICODE)
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With