Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unexpected result for python re.sub() with non-capturing character

Tags:

python

regex

I cannot understand the following output :

import re 

re.sub(r'(?:\s)ff','fast-forward',' ff')
'fast-forward'

According to the documentation :

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.

So why is the whitespace included in the captured occurence, and then replaced, since I added a non-capturing tag before it?

I would like to have the following output :

' fast-forward'
like image 211
plalanne Avatar asked Mar 07 '23 00:03

plalanne


1 Answers

The non-capturing group still matches and consumes the matched text. Note that consuming means adding the matched text to the match value (memory buffer alotted for the whole matched substring) and the corresponding advancing of the regex index. So, (?:\s) puts the whitespace into the match value, and it is replaced with the ff.

You want to use a look-behind to check for a pattern without consuming it:

re.sub(r'(?<=\s)ff','fast-forward',' ff')

See the regex demo.

An alternative to this approach is using a capturing group around the part of the pattern one needs to keep and a replacement backreference in the replacement pattern:

re.sub(r'(\s)ff',r'\1fast-forward',' ff')
         ^  ^      ^^ 

Here, (\s) saves the whitespace in Group 1 memory buffer and \1 in the replacement retrieves it and adds to the replacement string result.

See the Python demo:

import re 
print('"{}"'.format(re.sub(r'(?<=\s)ff','fast-forward',' ff')))
# => " fast-forward"
like image 183
Wiktor Stribiżew Avatar answered Mar 10 '23 11:03

Wiktor Stribiżew