I wrote up this regular expression:
p = re.compile(r'''
\[\[ #the first [[
[^:]*? #no :s are allowed
.*? #a bunch of chars
(
\| #either go until a |
|\]\] #or the last ]]
)
''', re.VERBOSE)
I want to use re.findall
to get all the matching sections of some string. I wrote some test code, but it gives me bizarre results.
This code
g = p.finditer(' [[Imae|Lol]] [[sdfef]]')
print g
for elem in g:
print elem.span()
print elem.group()
gives me this output:
(3, 10)
[[Imae|
(20, 29)
[[sdfef]]
Makes perfect sense right? But when I do this:
h = p.findall(' [[Imae|Lol]] [[sdfef]]')
for elem in h:
print elem
the output is this:
|
]]
Why isn't findall() printing out the same results as finditer??
Findall returns a list of matching groups. The parantheses in your regex defines a group that findall thinks you want, but you don't want groups. (?:...)
is a non-capturing paranthesis. Change your regex to:
'''
\[\[ #the first [[
[^:]*? #no :s are allowed
.*? #a bunch of chars
(?: #non-capturing group
\| #either go until a |
|\]\] #or the last ]]
)
'''
When you give re.findall()
a regex with groups (parenthesized expressions) in it, it returns the groups that match. Here, you've only got one group, and it's the | or ]] at the end. On the other hand, in the code where you use re.finditer(), you're asking for no group in particular, so it gives you the entire string.
You can get re.findall() to do what you want by putting parentheses around the whole regex -- or just around the part you're actually trying to extract. Assuming you're trying to parse wiki links, that would be the "bunch of chars" in line 4. For example,
p = re.compile(r'''
\[\[ #the first [[
[^:]*? #no :s are allowed
(.*?) #a bunch of chars
(
\| #either go until a |
|\]\] #or the last ]]
)
''', re.VERBOSE)
p.findall(' [[Imae|Lol]] [[sdfef]]')
returns:
[('Imae', '|'), ('sdfef', ']]')]
I think the key bit from the findall()
documentation is this:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Your regex has a group around the pipe or closing ]] here:
(
\| #either go until a |
|\]\] #or the last ]]
)
finditer()
doesn't appear to have any such clause.
They don't return the same thing. Some snippets from the docs:
findall
returns a list of strings. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
finditer
returns an iterator yielding MatchObject instances.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With