I am searching an XML file generated from Ms word for some phrases. The thing is that any phrase can be interrupted with some XML tags, that can come between words, or even inside words, as you can see in the example:
</w:rPr><w:t> To i</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:lang w:fareast="JA"/></w:rPr><w:t>ncrease knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level.</w:t></w:r></w:p>
So my approach to handle this problem was to simply reduce all XML tags into clusters of # characters of the same length, so that when I can find any phrase, the regex would ignore all the XML tags between each two characters.
What I need basically is the span of this phrase within the actual xml document, so I will use this span into later processing with the xml document, I cannot use clones.
This approach works remarkablly, but some phrases cause catastropic backtracking, such as the following example, so I need someone to point out where does the backtracking come from, or suggest a better solution to the problem.
================================
Here is an example:
I have this text where there are some clusters of # characters within it (which I want to keep), and the spaces are also unpredictable, such as the following:
Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expectedaccomplishment (c)#######
In order to match the following phrase:
Relationship to the strategic framework for the period 2014-2015: programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)
I came up with this regex to accommodate the unpredictable # and space characters:
u'R#*e#*l#*a#*t#*i#*o#*n#*s#*h#*i#*p#*\\s*#*t#*o#*\\s*#*t#*h#*e#*\\s*#*s#*t#*r#*a#*t#*e#*g#*i#*c#*\\s*#*f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r#*\\s*#*t#*h#*e#*\\s*#*p#*e#*r#*i#*o#*d#*\\s*#*2#*0#*1#*4#*\\-#*2#*0#*1#*5#*:#*\\s*#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*7#*\\,#*\\s*#*E#*c#*o#*n#*o#*m#*i#*c#*\\s*#*a#*n#*d#*\\s*#*S#*o#*c#*i#*a#*l#*\\s*#*A#*f#*f#*a#*i#*r#*s#*\\,#*\\s*#*s#*u#*b#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*3#*\\,#*\\s*#*e#*x#*p#*e#*c#*t#*e#*d#*\\s*#*a#*c#*c#*o#*m#*p#*l#*i#*s#*h#*m#*e#*n#*t#*\\s*#*\\(#*c#*\\)'
And it works fine in all the other phrases that I want to match, but this one has a problem leading to some catastrophic backtracking, can anyone spot it?
The original text is separated with xml tags, so to make it simpler for the regex, I replaced the tags with these # clusters, here is the original text:
</w:rPr><w:t>Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t> for the period 2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)</w:t>
Since the situation is that complex - don't use regex, just go through your line symbol by symbol:
etalone = "String to find"
etalone_length = len(etalone)
counter = 0
for symbol in your_line:
if symbol == etalone[counter]:
counter += 1
if counter == etalone_length:
print("String matches")
break
elif symbol != " " and sybmol != "#":
# Bad char found
print("Does not match!")
else: # exited 'for' before full etalone matched
print("Does not match!")
I just figured out, that above will not, actually, work if the very first symbol we match is not the one we're looking for. How about this instead:
a#b##ca#d#f and the line we're looking for is adf then we would start matching from the second a symbol.a in the original line. Set counter = break as span end)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With