Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex catastrophic backtracking

I am searching an XML file generated from Ms word for some phrases. The thing is that any phrase can be interrupted with some XML tags, that can come between words, or even inside words, as you can see in the example:

</w:rPr><w:t> To i</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:lang w:fareast="JA"/></w:rPr><w:t>ncrease knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level.</w:t></w:r></w:p>

So my approach to handle this problem was to simply reduce all XML tags into clusters of # characters of the same length, so that when I can find any phrase, the regex would ignore all the XML tags between each two characters.

What I need basically is the span of this phrase within the actual xml document, so I will use this span into later processing with the xml document, I cannot use clones.

This approach works remarkablly, but some phrases cause catastropic backtracking, such as the following example, so I need someone to point out where does the backtracking come from, or suggest a better solution to the problem.

================================

Here is an example:

I have this text where there are some clusters of # characters within it (which I want to keep), and the spaces are also unpredictable, such as the following:

Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected

accomplishment (c)#######

In order to match the following phrase:

Relationship to the strategic framework for the period 2014-2015: programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

I came up with this regex to accommodate the unpredictable # and space characters:

u'R#*e#*l#*a#*t#*i#*o#*n#*s#*h#*i#*p#*\\s*#*t#*o#*\\s*#*t#*h#*e#*\\s*#*s#*t#*r#*a#*t#*e#*g#*i#*c#*\\s*#*f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r#*\\s*#*t#*h#*e#*\\s*#*p#*e#*r#*i#*o#*d#*\\s*#*2#*0#*1#*4#*\\-#*2#*0#*1#*5#*:#*\\s*#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*7#*\\,#*\\s*#*E#*c#*o#*n#*o#*m#*i#*c#*\\s*#*a#*n#*d#*\\s*#*S#*o#*c#*i#*a#*l#*\\s*#*A#*f#*f#*a#*i#*r#*s#*\\,#*\\s*#*s#*u#*b#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*3#*\\,#*\\s*#*e#*x#*p#*e#*c#*t#*e#*d#*\\s*#*a#*c#*c#*o#*m#*p#*l#*i#*s#*h#*m#*e#*n#*t#*\\s*#*\\(#*c#*\\)'

And it works fine in all the other phrases that I want to match, but this one has a problem leading to some catastrophic backtracking, can anyone spot it?

The original text is separated with xml tags, so to make it simpler for the regex, I replaced the tags with these # clusters, here is the original text:

</w:rPr><w:t>Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t> for the period 2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)</w:t>

like image 590
hmghaly Avatar asked Apr 16 '26 09:04

hmghaly


1 Answers

Since the situation is that complex - don't use regex, just go through your line symbol by symbol:

etalone = "String to find"
etalone_length = len(etalone)
counter = 0
for symbol in your_line:
    if symbol == etalone[counter]:
        counter += 1
        if counter == etalone_length:
            print("String matches")
            break
    elif symbol != " " and sybmol != "#":
        # Bad char found
        print("Does not match!")
else:  # exited 'for' before full etalone matched
    print("Does not match!")

I just figured out, that above will not, actually, work if the very first symbol we match is not the one we're looking for. How about this instead:

  1. Clone your string
  2. Remove "#" from clone
  3. Match against pattern
  4. If pattern matches - get the location of matched result
  5. By that location - find which exact occurrence of the first symbol was matched. Like if full line is a#b##ca#d#f and the line we're looking for is adf then we would start matching from the second a symbol.
  6. Find nth occurrence of symbol a in the original line. Set counter =
  7. Use above algorithm (storing as span start and counter before break as span end)
like image 106
mishik Avatar answered Apr 17 '26 21:04

mishik



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!