Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word boundary to use in unicode text for Python regex

I want to use word boundary in a regex for matching some unicode text. Unicode letters are detected as word boundary in Python regex as here:

>>> re.search(r"\by\b","üyü")
<_sre.SRE_Match object at 0x02819E58>

>>> re.search(r"\by\b","ğyğ")
<_sre.SRE_Match object at 0x028250C8>

>>> re.search(r"\by\b","uyu")
>>>

What should I do in order to make the word boundary symbol not match unicode letters?

like image 861
Mert Nuhoglu Avatar asked Oct 15 '13 07:10

Mert Nuhoglu


People also ask

What is word boundary in regex python?

A word boundary is a zero-width test between two characters. To pass the test, there must be a word character on one side, and a non-word character on the other side. It does not matter which side each character appears on, but there must be one of each.

How do you match a word boundary in regex?

Word Boundary: \b The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

What is a boundary character regex?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12" , it would match before the 1 or after the 2.

What character would you use to start a regular expression pattern at a word boundary?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”.


1 Answers

Use re.UNICODE:

>>> re.search(r"\by\b","üyü", re.UNICODE)
>>> 
like image 132
Michael Brennan Avatar answered Nov 14 '22 22:11

Michael Brennan