Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: How do I capture a group after an optional capturing group using regular expressions?

Suppose I have the following strings:

s1=u'--FE(-)---'
s2=u'--FEM(-)---'
s3=u'--FEE(--)-'

and I want to match F,E,E,M and the content of the parentheses in different groups.

I have tried the following regular expression:

u'^.-([F])([EF]*)([E]+)[^FEM]?(M*)?(\\(.*\\))?.*$'

This expressions give the following groups and spans for the different strings:

s1 -> 'F',(2,3)   ,   '',(3,3)    ,    'E',(3,4)    ,    '',(5,5)    ,    None,(-1,-1)
s2 -> 'F',(2,3)   ,   '',(3,3)    ,    'E',(3,4)    ,    'M',(4,5)   ,    (-),(5,8)
s3 -> 'F',(2,3)   ,   'E',(3,4)   ,    'E',(4,5)    ,    '',(6,6)    ,    None,(-1,-1)

For s2, I get the wanted behaviour, a matching of the contents of the parentheses, but for s1 and s3 I don't.

How do I create a regular expression that will match the content of the parentheses even if I don't have a proper match for the group containing 'M's?

EDIT:

The answer by DWilches resolved the initial issue using the regular expression

'^.-(F)([EF]*)(E+)[^FEM]??(M*)(\(.*\)).*?$'

However, the parentheses group is also optional. The following short python script clarifies the problem:

s1=u'--FE(-)---'
s2=u'--FEM(-)--'
s3=u'--FEE(--)-'
s4=u'--FEE-M(---)--'
s5=u'--FE-M-(-)-'
s6=u'--FEM--'
s7=u'--FE-M--'

ll=[s1,s2,s3,s4,s5,s6,s7]

import re
rr1=re.compile(u'^.-(F)([EF]*)(E+)[^FEM]??(M*)[^FEM]??(\(.*\)).*?$')
rr2=re.compile(u'^.-(F)([EF]*)(E+)[^FEM]??(M*)[^FEM]??(\(.*\))?.*?$')

for s in ll:
    b=rr1.search(s)
    print s
    if b:
        print " '%s' '%s' '%s' '%s' '%s' " % (b.group(1), b.group(2), b.group(3),     b.group(4), b.group(5))
    else:
        print 'No match'
    print '######'

For rr1, the output is:

--FE(-)---
 'F' '' 'E' '' '(-)' 
######
--FEM(-)--
 'F' '' 'E' 'M' '(-)' 
######
--FEE(--)-
 'F' 'E' 'E' '' '(--)' 
######
--FEE-M(---)--
 'F' 'E' 'E' 'M' '(---)' 
######
--FE-M-(-)-
 'F' '' 'E' 'M' '(-)' 
######
--FEM--
No match
######
--FE-M--
No match
######

It is OK for the first 5 strings, but not for the two last, since it requires the parentheses.

The rr2, however, adding ? to (\(.*\)), yields the following output:

--FE(-)---
 'F' '' 'E' '' '(-)' 
######
--FEM(-)--
 'F' '' 'E' 'M' '(-)' 
######
--FEE(--)-
 'F' 'E' 'E' '' '(--)' 
######
--FEE-M(---)--
 'F' 'E' 'E' '' 'None' 
######
--FE-M-(-)-
 'F' '' 'E' '' 'None' 
######
--FEM--
 'F' '' 'E' 'M' 'None' 
######
--FE-M--
 'F' '' 'E' '' 'None' 
######

This is ok for s1,s2,s3 and s6.

Some modification is needed to yield the desired output: getting the M if it exists and the content of the parentheses if the parentheses exist.

like image 942
Erlend Aune Avatar asked Oct 21 '22 19:10

Erlend Aune


1 Answers

It seems you need to use non-greedy operators:

^.-(F)([EF]*)(E+)[^FEM]??(M*)(\\(.*\\))?.*?$

Note that at the last of the last .* I added a ?. And I also changed [^FEM]? for [^FEM]??.

In the first of your samples the problem was that that last .* was eating up this: -) while your [^FEM]? was eating up this: ( ... thus not leaving anything for (\\(.*\\))?

(I also removed some square brackets around single letters, but that was more to have a shorter regex)

With this regex I obtain the following results:

--FE(-)---    ->     'F'    ''     'E'    ''     '(-)'
--FEM(-)---   ->     'F'    ''     'E'    'M'    '(-)'
--FEE(--)-    ->     'F'    'E'    'E'    ''     '(--)'

BTW: I will also remove the ? at the end of (\\(.*\\))? because even if you don't put it there, a string that don't match that part will be consumed by the following .*?.

like image 63
Daniel Avatar answered Oct 23 '22 10:10

Daniel