I am aware that definition of word boundary is (?<!\w)(?=\w)|(?<=\w)(?!\w)
and i wish to add underscore(optionally) too in definition of word boundary.
The one way of doing it is we can simply modify the definition
like the new one would be (_)?((?<!\w)(?=\w)|(?<=\w)(?!\w))
, but don't wish to use too long expression.
Easy Approach can be
If i can write word boundary inside character class, then adding underscore inside character class would be very easy just like [\b-]
, but the problem is that putting \b
inside character class i.e. [\b]
, means back space character not word boundary.
please tell the solution i.e. how to put \b
inside character class without losing its original meaning.
You may use lookarounds:
(?:\b|(?<=_))word(?=\b|_)
^^^^^^^^^^^^^ ^^^^^^^
See the regex demo where (?:\b|(?<=_))
is a non-capturing group matching either a word boundary or a location preceded with _
, and (?=\b|_)
is a positive lookahead matching either a word boundary or a _
symbol.
Unfortunately, Python re
won't allow using (?<=\b|_)
as the lookbehind pattern should be of fixed width (else, you will get look-behind requires fixed-width pattern
error).
A Python demo:
import re
rx = r"(?:\b|(?<=_))word(?=\b|_)"
s = "some_word_here and a word there"
print(re.findall(rx,s))
An alternative solution is to use custom word boundaries like (?<![^\W_])
/ (?![^\W_])
(see online demo):
rx = r"(?<![^\W_])word(?![^\W_])"
The (?<![^\W_])
negative lookbehind fails a match if there is no character other than non-word and _
char (so, it requires the start of string or any word char excluding _
before the search word) and (?![^\W_])
negative lookahead will fail the match if there is no char other than non-word and _
char (that is, requires the end of string or a word char excluding _
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With