Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match a whole word in a string using dynamic regex

Tags:

python

regex

I am looking to see whether a word occurs in a sentence using regex. Words are separated by spaces, but may have punctuation on either side. If the word is in the middle of the string, the following match works (it prevents part-words from matching, allows punctuation on either side of the word).

match_middle_words = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d ]{0,} "

This won't however match the first or last word, since there is no trailing/leading space. So, for these cases, I have also been using:

match_starting_word = "^[^a-zA-Z\d]{0,}" + word + "[^a-zA-Z\d ]{0,} "
match_end_word = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d]{0,}$"

and then combining with

 match_string = match_middle_words  + "|" + match_starting_word  +"|" + match_end_word 

Is there a simple way to avoid the need of three match terms. Specifically, is there a way of specifying 'ether a space or the start of file (i.e. "^") and similar, 'either a space or the end of the file (i.e. "$")?

like image 800
kyrenia Avatar asked May 01 '15 22:05

kyrenia


People also ask

How do you match a word in regex?

To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.

How do you match a whole word in Python?

You can use the simple Python membership operator. You can use a default regex with no special metacharacters. You can use the word boundary metacharacter '\b' to match only whole words. You can match case-insensitive by using the flags argument re.

What does \b do in regular expression?

Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.


1 Answers

Why not use a word boundary?

match_string = r'\b' + word + r'\b'
match_string = r'\b{}\b'.format(word)
match_string = rf'\b{word}\b'          # Python 3.7+ required

If you have a list of words (say, in a words variable) to be matched as a whole word, use

match_string = r'\b(?:{})\b'.format('|'.join(words))
match_string = rf'\b(?:{"|".join(words)})\b'         # Python 3.7+ required

In this case, you will make sure the word is only captured when it is surrounded by non-word characters. Also note that \b matches at the string start and end. So, no use adding 3 alternatives.

Sample code:

import re
strn = "word hereword word, there word"
search = "word"
print re.findall(r"\b" + search + r"\b", strn)

And we found our 3 matches:

['word', 'word', 'word']

NOTE ON "WORD" BOUNDARIES

When the "words" are in fact chunks of any chars you should re.escape them before passing to the regex pattern:

match_string = r'\b{}\b'.format(re.escape(word)) # a single escaped "word" string passed
match_string = r'\b(?:{})\b'.format("|".join(map(re.escape, words))) # words list is escaped
match_string = rf'\b(?:{"|".join(map(re.escape, words))})\b' # Same as above for Python 3.7+

If the words to be matched as whole words may start/end with special characters, \b won't work, use unambiguous word boundaries:

match_string = r'(?<!\w){}(?!\w)'.format(re.escape(word))
match_string = r'(?<!\w)(?:{})(?!\w)'.format("|".join(map(re.escape, words))) 

If the word boundaries are whitespace chars or start/end of string, use whitespace boundaries, (?<!\S)...(?!\S):

match_string = r'(?<!\S){}(?!\S)'.format(word)
match_string = r'(?<!\S)(?:{})(?!\S)'.format("|".join(map(re.escape, words))) 
like image 65
Wiktor Stribiżew Avatar answered Nov 03 '22 13:11

Wiktor Stribiżew