Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python match only captures first and last group - am I misunderstanding something?

I'm working on a little Python script that is supposed to match a series of authors and I'm using the re-module for that. I came across something unexpected and I have been able to reduce it to the following very simple example:

>>> import re
>>> s = "$word1$, $word2$, $word3$, $word4$"
>>> word = r'\$(word\d)\$'
>>> m = re.match(word+'(?:, ' + word + r')*', s)
>>> m.groups()
('word1', 'word4')

So I'm defining a 'basic' regexp that matches the main parts of my input, with some recognizable features (in this case I used the $-signs) and than I try to match one word plus a possible additional list of words.

I'd have expected that m.groups() would've displayed:

>>> m.groups()
('word1', 'word2', 'word3', 'word4')

But apparently I'm doing something wrong. I'd like to know why this solution does not work and how to change it, such that I get the result I'm looking for. BTW, this is with Python 2.6.6 on a Linux machine, in case that matters.

like image 522
Jakob van Bethlehem Avatar asked Jun 11 '12 08:06

Jakob van Bethlehem


2 Answers

Although you're re is matching every $word#$, the second capture group is continuously getting replaced by the last item matched.

Let's take a look at the debugger:

>>> expr = r"\$(word\d)\$(?:, \$(word\d)\$)*"
>>> c = re.compile(expr, re.DEBUG)
literal 36
subpattern 1
  literal 119
  literal 111
  literal 114
  literal 100
  in
    category category_digit
literal 36
max_repeat 0 65535
  subpattern None
    literal 44
    literal 32
    literal 36
    subpattern 2
      literal 119
      literal 111
      literal 114
      literal 100
      in
        category category_digit
    literal 36

As you can see, there are only 2 capture groups: subpattern 1 and subpattern 2. Every time another $word#$ is found, subpattern 2 gets overwritten.

As for a potential solution, I would recommend using re.findall() instead of re.match():

>>> s = "$word1$, $word2$, $word3$, $word4$"
>>> authors = re.findall(r"\$(\w+)\$", s)
>>> authors
['word1', 'word2', 'word3', 'word4']
like image 120
Joel Cornett Avatar answered Sep 29 '22 06:09

Joel Cornett


There are only two capture groups in your regexp. Try re.findall(word, s) instead.

Repeated captures are supported by regex module.

like image 25
jfs Avatar answered Sep 29 '22 08:09

jfs