Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression group capture with multiple matches

Tags:

python

regex

Quick regular expression question.
I'm trying to capture multiple instances of a capture group in python (don't think it's python specific), but the subsequent captures seems to overwrite the previous.

In this over-simplified example, I'm essentially trying to split a string:

x = 'abcdef'
r = re.compile('(\w){6}')
m = r.match(x)
m.groups()     # = ('f',) ?!?
I want to get ('a', 'b', 'c', 'd', 'e', 'f'), but because regex overwrites subsequent captures, I get ('f',)

Is this how regex is supposed to behave? Is there a way to do what I want without having to repeat the syntax six times?

Thanks in advance!
Andrew

like image 210
Andrew Klofas Avatar asked Apr 08 '11 16:04

Andrew Klofas


4 Answers

You can't use groups for this, I'm afraid. Each group can match only once, I believe all regexes work this way. A possible solution is to try to use findall() or similar.

r=re.compile(r'\w')
r.findall(x)
# 'a', 'b', 'c', 'd', 'e', 'f'
like image 142
sverre Avatar answered Oct 08 '22 18:10

sverre


The regex module can do this.

> m = regex.match('(\w){6}', "abcdef")
> m.captures(1)
['a', 'b', 'c', 'd', 'e', 'f']

Also works with named captures:

> m = regex.match('(?P<letter>)\w)', "abcdef")
> m.capturesdict()
{'letter': ['a', 'b', 'c', 'd', 'e', 'f']}

The regex module is expected to replace the 're' module - it is a drop-in replacement that acts identically, except it has many more features and capabilities.

like image 38
rjh Avatar answered Oct 08 '22 20:10

rjh


To find all matches in a given string use re.findall(regex, string). Also, if you want to obtain every letter here, your regex should be either '(\w){1}' or just '(\w)'.

See:

r = re.compile('(\w)')
l = re.findall(r, x)

l == ['a', 'b', 'c', 'd', 'e', 'f']
like image 30
pajton Avatar answered Oct 08 '22 18:10

pajton


I suppose your question is a simplified presentation of your need.

Then, I take an exemple a little more complex:

import re

pat = re.compile('[UI][bd][ae]')

ch = 'UbaUdeIbaIbeIdaIdeUdeUdaUdeUbeIda'

print [mat.group() for mat in pat.finditer(ch)]

result

['Uba', 'Ude', 'Iba', 'Ibe', 'Ida', 'Ide', 'Ude', 'Uda', 'Ude', 'Ube', 'Ida']
like image 1
eyquem Avatar answered Oct 08 '22 20:10

eyquem