Suppose I have the following strings:
s1=u'--FE(-)---'
s2=u'--FEM(-)---'
s3=u'--FEE(--)-'
and I want to match F,E,E,M and the content of the parentheses in different groups.
I have tried the following regular expression:
u'^.-([F])([EF]*)([E]+)[^FEM]?(M*)?(\\(.*\\))?.*$'
This expressions give the following groups and spans for the different strings:
s1 -> 'F',(2,3) , '',(3,3) , 'E',(3,4) , '',(5,5) , None,(-1,-1)
s2 -> 'F',(2,3) , '',(3,3) , 'E',(3,4) , 'M',(4,5) , (-),(5,8)
s3 -> 'F',(2,3) , 'E',(3,4) , 'E',(4,5) , '',(6,6) , None,(-1,-1)
For s2, I get the wanted behaviour, a matching of the contents of the parentheses, but for s1 and s3 I don't.
How do I create a regular expression that will match the content of the parentheses even if I don't have a proper match for the group containing 'M's?
EDIT:
The answer by DWilches resolved the initial issue using the regular expression
'^.-(F)([EF]*)(E+)[^FEM]??(M*)(\(.*\)).*?$'
However, the parentheses group is also optional. The following short python script clarifies the problem:
s1=u'--FE(-)---'
s2=u'--FEM(-)--'
s3=u'--FEE(--)-'
s4=u'--FEE-M(---)--'
s5=u'--FE-M-(-)-'
s6=u'--FEM--'
s7=u'--FE-M--'
ll=[s1,s2,s3,s4,s5,s6,s7]
import re
rr1=re.compile(u'^.-(F)([EF]*)(E+)[^FEM]??(M*)[^FEM]??(\(.*\)).*?$')
rr2=re.compile(u'^.-(F)([EF]*)(E+)[^FEM]??(M*)[^FEM]??(\(.*\))?.*?$')
for s in ll:
b=rr1.search(s)
print s
if b:
print " '%s' '%s' '%s' '%s' '%s' " % (b.group(1), b.group(2), b.group(3), b.group(4), b.group(5))
else:
print 'No match'
print '######'
For rr1
, the output is:
--FE(-)---
'F' '' 'E' '' '(-)'
######
--FEM(-)--
'F' '' 'E' 'M' '(-)'
######
--FEE(--)-
'F' 'E' 'E' '' '(--)'
######
--FEE-M(---)--
'F' 'E' 'E' 'M' '(---)'
######
--FE-M-(-)-
'F' '' 'E' 'M' '(-)'
######
--FEM--
No match
######
--FE-M--
No match
######
It is OK for the first 5 strings, but not for the two last, since it requires the parentheses.
The rr2
, however, adding ?
to (\(.*\))
, yields the following output:
--FE(-)---
'F' '' 'E' '' '(-)'
######
--FEM(-)--
'F' '' 'E' 'M' '(-)'
######
--FEE(--)-
'F' 'E' 'E' '' '(--)'
######
--FEE-M(---)--
'F' 'E' 'E' '' 'None'
######
--FE-M-(-)-
'F' '' 'E' '' 'None'
######
--FEM--
'F' '' 'E' 'M' 'None'
######
--FE-M--
'F' '' 'E' '' 'None'
######
This is ok for s1,s2,s3
and s6
.
Some modification is needed to yield the desired output: getting the M
if it exists and the content of the parentheses if the parentheses exist.
It seems you need to use non-greedy operators:
^.-(F)([EF]*)(E+)[^FEM]??(M*)(\\(.*\\))?.*?$
Note that at the last of the last .*
I added a ?
. And I also changed [^FEM]?
for [^FEM]??
.
In the first of your samples the problem was that that last .*
was eating up this: -)
while your [^FEM]?
was eating up this: (
... thus not leaving anything for (\\(.*\\))?
(I also removed some square brackets around single letters, but that was more to have a shorter regex)
With this regex I obtain the following results:
--FE(-)--- -> 'F' '' 'E' '' '(-)'
--FEM(-)--- -> 'F' '' 'E' 'M' '(-)'
--FEE(--)- -> 'F' 'E' 'E' '' '(--)'
BTW: I will also remove the ?
at the end of (\\(.*\\))?
because even if you don't put it there, a string that don't match that part will be consumed by the following .*?
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With