Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does re.findall() give me different results than re.finditer() in Python?

Tags:

python

regex

I wrote up this regular expression:

p = re.compile(r'''
\[\[            #the first [[
[^:]*?          #no :s are allowed
.*?             #a bunch of chars
(
\|              #either go until a |
|\]\]           #or the last ]]
)
                ''', re.VERBOSE)

I want to use re.findall to get all the matching sections of some string. I wrote some test code, but it gives me bizarre results.

This code

g = p.finditer('   [[Imae|Lol]]     [[sdfef]]')
print g
for elem in g:
    print elem.span()
    print elem.group()

gives me this output:

(3, 10)
[[Imae|
(20, 29)
[[sdfef]] 

Makes perfect sense right? But when I do this:

h = p.findall('   [[Imae|Lol]]     [[sdfef]]')
for elem in h:
    print elem

the output is this:

|
]]  

Why isn't findall() printing out the same results as finditer??

like image 929
Aaron Brown Avatar asked May 27 '11 21:05

Aaron Brown


4 Answers

Findall returns a list of matching groups. The parantheses in your regex defines a group that findall thinks you want, but you don't want groups. (?:...) is a non-capturing paranthesis. Change your regex to:

'''
\[\[            #the first [[
[^:]*?          #no :s are allowed
.*?             #a bunch of chars
(?:             #non-capturing group
\|              #either go until a |
|\]\]           #or the last ]]
)
                '''
like image 80
sverre Avatar answered Oct 05 '22 22:10

sverre


When you give re.findall() a regex with groups (parenthesized expressions) in it, it returns the groups that match. Here, you've only got one group, and it's the | or ]] at the end. On the other hand, in the code where you use re.finditer(), you're asking for no group in particular, so it gives you the entire string.

You can get re.findall() to do what you want by putting parentheses around the whole regex -- or just around the part you're actually trying to extract. Assuming you're trying to parse wiki links, that would be the "bunch of chars" in line 4. For example,

p = re.compile(r'''
\[\[            #the first [[
[^:]*?          #no :s are allowed
(.*?)           #a bunch of chars
(
\|              #either go until a |
|\]\]           #or the last ]]
)
                ''', re.VERBOSE)

p.findall('   [[Imae|Lol]]     [[sdfef]]')

returns:

[('Imae', '|'), ('sdfef', ']]')]
like image 38
rspeer Avatar answered Oct 05 '22 23:10

rspeer


I think the key bit from the findall() documentation is this:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Your regex has a group around the pipe or closing ]] here:

(
\|              #either go until a |
|\]\]           #or the last ]]
)

finditer() doesn't appear to have any such clause.

like image 44
CanSpice Avatar answered Oct 05 '22 23:10

CanSpice


They don't return the same thing. Some snippets from the docs:

findall returns a list of strings. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

finditer returns an iterator yielding MatchObject instances.

like image 31
Steven Rumbalski Avatar answered Oct 06 '22 00:10

Steven Rumbalski