Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does re.findall() find more matches than re.sub()?

Tags:

python

regex

Consider the following:

>>> import re
>>> a = "first:second"
>>> re.findall("[^:]*", a)
['first', '', 'second', '']
>>> re.sub("[^:]*", r"(\g<0>)", a)
'(first):(second)'

re.sub()'s behavior makes more sense initially, but I can also understand re.findall()'s behavior. After all, you can match an empty string between first and : that consists only of non-colon characters (exactly zero of them), but why isn't re.sub() behaving the same way?

Shouldn't the result of the last command be (first)():(second)()?

like image 200
Tim Pietzcker Avatar asked May 04 '13 06:05

Tim Pietzcker


People also ask

What is the difference between re search and re Findall?

The re.It searches from start or end of the given string. If we use method findall to search for a pattern in a given string it will return all occurrences of the pattern. While searching a pattern, it is recommended to use re. findall() always, it works like re.search() and re.

What is difference between Search () and Findall () methods in Python?

Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.

How does regex Findall work?

The findall() function scans the string from left to right and finds all the matches of the pattern in the string . The result of the findall() function depends on the pattern: If the pattern has no capturing groups, the findall() function returns a list of strings that match the whole pattern.

How does re Findall work in Python?

How Does the findall() Method Work in Python? The re. findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern . It returns a list of strings in the matching order when scanning the string from left to right.


1 Answers

You use the * which allows empty matches:

'first'   -> matched
':'       -> not in the character class but, as the pattern can be empty due 
             to the *, an empty string is matched -->''
'second'  -> matched
'$'       -> can contain an empty string before,
             an empty string is matched -->''

Quoting the documentation for re.findall():

Empty matches are included in the result unless they touch the beginning of another match.

The reason you don't see empty matches in sub results is explained in the documentation for re.sub():

Empty matches for the pattern are replaced only when not adjacent to a previous match.

Try this:

re.sub('(?:Choucroute garnie)*', '#', 'ornithorynque') 

And now this:

print re.sub('(?:nithorynque)*', '#', 'ornithorynque')

There is no consecutive #

like image 86
Casimir et Hippolyte Avatar answered Oct 01 '22 05:10

Casimir et Hippolyte