Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write word boundary inside character class in python without losing its meaning? I wish to add underscore(_) in definition of word boundary(\b)

Tags:

python

regex

I am aware that definition of word boundary is (?<!\w)(?=\w)|(?<=\w)(?!\w) and i wish to add underscore(optionally) too in definition of word boundary.

The one way of doing it is we can simply modify the definition like the new one would be (_)?((?<!\w)(?=\w)|(?<=\w)(?!\w)) , but don't wish to use too long expression.

Easy Approach can be If i can write word boundary inside character class, then adding underscore inside character class would be very easy just like [\b-], but the problem is that putting \b inside character class i.e. [\b], means back space character not word boundary.

please tell the solution i.e. how to put \b inside character class without losing its original meaning.

like image 581
Aakash Goel Avatar asked Oct 29 '22 14:10

Aakash Goel


1 Answers

You may use lookarounds:

(?:\b|(?<=_))word(?=\b|_)
^^^^^^^^^^^^^     ^^^^^^^

See the regex demo where (?:\b|(?<=_)) is a non-capturing group matching either a word boundary or a location preceded with _, and (?=\b|_) is a positive lookahead matching either a word boundary or a _ symbol.

Unfortunately, Python re won't allow using (?<=\b|_) as the lookbehind pattern should be of fixed width (else, you will get look-behind requires fixed-width pattern error).

A Python demo:

import re
rx = r"(?:\b|(?<=_))word(?=\b|_)"
s = "some_word_here and a word there"
print(re.findall(rx,s))

An alternative solution is to use custom word boundaries like (?<![^\W_]) / (?![^\W_]) (see online demo):

rx = r"(?<![^\W_])word(?![^\W_])"

The (?<![^\W_]) negative lookbehind fails a match if there is no character other than non-word and _ char (so, it requires the start of string or any word char excluding _ before the search word) and (?![^\W_]) negative lookahead will fail the match if there is no char other than non-word and _ char (that is, requires the end of string or a word char excluding _).

like image 114
Wiktor Stribiżew Avatar answered Nov 15 '22 05:11

Wiktor Stribiżew