Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optional dot in regex

Say I want to replace all the matches of Mr. and Mr with Mister.

I am using the following regex: \bMr(\.)?\b to match either Mr. or just Mr. Then, I use the re.sub() method to do the replacement.

What is puzzling me is that it is replacing Mr. with Mister.. Why is this keeping the dot . at the end? It looks like it is not matching the Mr\. case but just Mr.

import re
s="a rMr. Nobody Mr. Nobody is Mr Nobody and Mra Nobody."
re.sub(r"\bMr(\.)?\b","Mister", s)

Returns:

'a rMr. Nobody Mister. Nobody is Mister Nobody and Mra Nobody.'

I also tried with the following, but also without luck:

re.sub(r"\b(Mr\.|Mr)\b","Mister", s)

My desired output is:

'a rMr. Nobody Mister Nobody is Mister Nobody and Mra Nobody.'
                     ^                              ^
                     no dot            this should be kept as it is
like image 327
fedorqui 'SO stop harming' Avatar asked Nov 13 '14 11:11

fedorqui 'SO stop harming'


People also ask

What is an optional character in regex?

It simply looks either that particular character is present or not. It makes the character as optional, the regex will select if the character is there, and it will also match if the character is not in the test string. For zero or more repetition * is used.

How do I make the dot match all characters in regex?

Except for JavaScript and VBScript, all regex flavors discussed here have an option to make the dot match all characters, including line breaks. In PowerGREP, tick the checkbox labeled “dot matches line breaks” to make the dot match all characters. In EditPad Pro, turn on the “Dot” or “Dot matches newline” search option.

Why don't regular expressions have a dot in them?

This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.

What does 0 mean in regex?

? as a metacharacter here means zero or 1 repetition. It simply looks either that particular character is present or not. It makes the character as optional, the regex will select if the character is there, and it will also match if the character is not in the test string.


2 Answers

I think you want to capture 'Mr' followed by either a '.' or a word boundary:

r"\bMr(?:\.|\b)"

In use:

>>> import re
>>> re.sub(r"\bMr(?:\.|\b)", "Mister", "a rMr. Nobody Mr. Nobody is Mr Nobody and Mra Nobody.")
'a rMr. Nobody Mister Nobody is Mister Nobody and Mra Nobody.'
like image 113
jonrsharpe Avatar answered Sep 30 '22 07:09

jonrsharpe


re.sub(r"\bMr\.|\bMr\b","Mister", s)

Try this.You need to remove \b after .

Output:a rMr. Nobody Mister Nobody is Mister Nobody and Mra Nobody.'

The reason why \bMr(\.)?\b is not working because between . and space there is no word boundary.

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.
like image 21
vks Avatar answered Sep 30 '22 05:09

vks