Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expressions : (.*), (.*?) and .* [duplicate]

Tags:

python

regex

Could someone explain to me the difference between these 3 blocks:

1 -> (.*)
2 -> (.*?)
3 -> .*

As I understand, ? makes the last character optional so why put it ? And why not put the parenthesis at the end?

This comes from here: http://www.tutorialspoint.com/python/python_reg_expressions.htm

1st example : searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)
like image 417
John Doe Avatar asked Jan 10 '15 21:01

John Doe


1 Answers

.* will match any character (including newlines if dotall is used). This is greedy: it matches as much as it can.

(.*) will add that to a capture group.

(.*?) the ? makes the .* non-greedy, matching as little as it can to make a match, and the parenthesis makes it a capture group as well.

For example:

>>> import re
>>> txt = ''' foo
... bar
... baz '''
>>> for found in re.finditer('(.*)', txt):
...     print found.groups()
... 
(' foo',)
('',)
('bar',)
('',)
('baz ',)
('',)
>>> for found in re.finditer('.*', txt):
...     print found.groups()
... 
()
()
()
()
()
()
>>> for found in re.finditer('.*', txt, re.DOTALL):
...     print found.groups()
... 
()
()
>>> for found in re.finditer('(.*)', txt, re.DOTALL):
...     print found.groups()
... 
(' foo\nbar\nbaz ',)
('',)

And since the ? matches as little as possible, we match empty strings:

>>> for found in re.finditer('(.*?)', txt, re.DOTALL):
...     print found.groups()
... 
('',)
('',)
('',)
('',)
('',)
('',)
('',)
('',)
('',)
('',)
('',)
('',)
('',)
('',)
like image 63
Russia Must Remove Putin Avatar answered Sep 22 '22 22:09

Russia Must Remove Putin