Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Two word boundaries (\b) to isolate a single word

Tags:

python

regex

I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".

The tower is 100 meters tall.

Here's the pattern that I tried which didn't work:

\d+\s*(\b.+\b)

But this one did:

\d+\s*(\w+)

The first incorrect pattern matched this:

The tower is 100 meters tall.

I didn't want the word "tall" to be matched. I expected the following behavior:

\d+ match one or more occurrence of a digit
\s* match any or no spaces
( start new capturing group
\b find the word/non-word boundary
.+ match 1 or more of everything except new line
\b find the next word/non-word boundary
) stop capturing group

The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)?


This is Python flavored
Here's the regex101 test link of the above regex.

like image 465
Renae Lider Avatar asked Jan 29 '26 22:01

Renae Lider


2 Answers

It didn't stop because + is greedy by default, you want +? for a non-greedy match.

A concise explanation — * and + are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.

You need to follow these operators with ? for a non-greedy match, going in the above order it would be (*?) "zero or more" or (+?) "one or more" — but preferably "as few as possible".

Also a word boundary \b matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b around the . if you're unclear what's in between the boundaries.

like image 78
hwnd Avatar answered Feb 01 '26 14:02

hwnd


It match both words because . match (nearly) all characters, so also space character, and because + is greedy, so it will match as much as it could. If you would use \w instead of . it would work (because \w match only word characters - a-zA-Z_0-9).

like image 21
m.cekiera Avatar answered Feb 01 '26 12:02

m.cekiera



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!