Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex word boundary expressions

Tags:

c#

regex

Say for example I have the following string "one two(three) (three) four five" and I want to replace "(three)" with "(four)" but not within words. How would I do it?

Basically I want to do a regex replace and end up with the following string:

"one two(three) (four) four five" 

I have tried the following regex but it doesn't work:

@"\b\(three\)\b" 

Basically I am writing some search and replace code and am giving the user the usual options to match case, match whole word etc. In this instance the user has chosen to match whole words but I don't know what the text being searched for will be.

like image 226
CroweMan Avatar asked Aug 12 '10 13:08

CroweMan


People also ask

What is a word boundary in regex?

A word boundary is a zero-width test between two characters. To pass the test, there must be a word character on one side, and a non-word character on the other side. It does not matter which side each character appears on, but there must be one of each.

What characters are word boundaries in regex?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”.

What does \b mean in regex?

The \b metacharacter matches at the beginning or end of a word.

What is a word boundary example?

For example, the / three / little / pigs / went / to / market. . . . Indivisibility: Say a sentence out loud, and ask someone to 'add extra words' to it. The extra item will be added between the words and not within them.


2 Answers

Your problem stems from a misunderstanding of what \b actually means. Admittedly, it is not obvious.

The reason \b\(three\)\b doesn’t match the threes in your input string is the following:

  • \b means: the boundary between a word character and a non-word character.
  • Letters (e.g. a-z) are considered word characters.
  • Punctuation marks such as ( are considered non-word characters.

Here is your input string again, stretched out a bit, and I’ve marked the places where \b matches:

 o n e   t w o ( t h r e e )   ( t h r e e )   f o u r   f i v e ↑     ↑ ↑     ↑ ↑         ↑     ↑         ↑   ↑       ↑ ↑       ↑ 

As you can see here, there is a \b between “two” and “(three)”, but not before the second “(three)”.

The moral of the story? “Whole-word search” doesn’t really make much sense if what you’re searching for is not just a word (a string of letters). Since you have punctuation characters (parentheses) in your search string, it is not as such a “word”. If you searched for a word consisting only of word characters, then \b would do what you expect.

You can, of course, use a different Regex to match the string only if it surrounded by spaces or occurs at the beginning or end of the string:

(^|\s)\(three\)(\s|$) 

However, the problem with this is, of course, that if you search for “three” (without the parentheses), it won’t find the one in “(three)” because it doesn’t have spaces around it, even though it is actually a whole word.

I think most text editors (including Visual Studio) will use \b only if your search string actually starts and/or ends with a word character:

var pattern = Regex.Escape(searchString); if (Regex.IsMatch(searchString, @"^\w"))     pattern = @"\b" + pattern; if (Regex.IsMatch(searchString, @"\w$"))     pattern = pattern + @"\b"; 

That way they will find “(three)” even if you select “whole words only”.

like image 123
Timwi Avatar answered Sep 16 '22 14:09

Timwi


Here a simple code you may be interested in:

    string pattern = @"\b" + find + @"\b";     Regex.Replace(stringToSearch, pattern, replace, RegexOptions.IgnoreCase); 

Source code: snip2code - C#: Replace an exact word in a sentence

like image 38
Dominique Terrs Avatar answered Sep 19 '22 14:09

Dominique Terrs