Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make word boundary \b not match on dashes

Tags:

python

regex

I simplified my code to the specific problem I am having.

import re
pattern = re.compile(r'\bword\b')
result = pattern.sub(lambda x: "match", "-word- word")

I am getting

'-match- match'

but I want

'-word- match'

edit:

Or for the string "word -word-"

I want

"match -word-"
like image 902
alpalalpal Avatar asked Sep 25 '16 08:09

alpalalpal


People also ask

What character's do you use to match on a word boundary?

The following three positions are qualified as word boundaries: Before the first character in a string if the first character is a word character. After the last character in a string if the last character is a word character. Between two characters in a string if one is a word character and the other is not.

What is \b word boundary?

A word boundary \b is a test, just like ^ and $ . When the regexp engine (program module that implements searching for regexps) comes across \b , it checks that the position in the string is a word boundary.

What character would you use to start a regular expression pattern at a word boundary?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”.


2 Answers

What you need is a negative lookbehind.

pattern = re.compile(r'(?<!-)\bword\b')
result = pattern.sub(lambda x: "match", "-word- word")

To cite the documentation:

(?<!...) Matches if the current position in the string is not preceded by a match for ....

So this will only match, if the word-break \b is not preceded with a minus sign -.

If you need this for the end of the string you'll have to use a negative lookahead which will look like this: (?!-). The complete regular expression will then result in: (?<!-)\bword(?!-)\b

like image 188
Matthias Avatar answered Oct 01 '22 10:10

Matthias


\b basically denotes a word boundary on characters other than [a-zA-Z0-9_] which includes spaces as well. Surround word with negative lookarounds to ensure there is no non-space character after and before it:

re.compile(r'(?<!\S)word(?!\S)')
like image 40
revo Avatar answered Oct 01 '22 10:10

revo