Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match double hyphens in comments of malformed XML

Tags:

regex

pcre

autoit

I'm to parse XML files that do not conform to the "no double hyphens in comments" -standard, which makes MSXML complain. I am looking for a way of deleting offending hyphens.

I am using StringRegExpReplace(). I attempted following regular expressions:

<!--(.*)--> : correctly gets comments
<!--(-*)--> : fails to be a correct regex (also tried escaping and using \x2D)

Given the right pattern, I would call:

StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing

How to match remaining extra hyphens within an XML comment, while leaving the remaining text alone?

like image 389
Pirvu Paul Daniel Avatar asked Jan 23 '15 21:01

Pirvu Paul Daniel


1 Answers

You can use this pattern:

(?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))

details:

(?| 
    \G(?!\A) # contiguous to the precedent match (inside a comment)

    (?|
        -{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
      |
         (-[^-]+)          # preserve isolated hyphens 
      |
         -+ (?=-->)        # hyphens before closing sequence, break contiguity
      |
         -->[^<]*          # closing sequence, go to next <
         (*SKIP)(*FAIL)    # break contiguity
    )
  |
    [^<]*<+ # reach the next < (outside comment)
    (?> [^<]+ <+ )*?       # next < until !-- or the end of the string 
    (?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string
    (?|
        -*+ ([^->][^-]*)   # possible hyphens not followed by >
      |
        -+ (?=-->)         # hyphens before closing sequence, break contiguity
      |
        -?+ ([^-]+)        # one hyphen followed by >
      |
        -->[^<]*           # closing sequence, go to next <
        (*SKIP)(*FAIL) ()  # break contiguity (note: "()" avoids a mysterious bug
    )                      # in regex101, you can remove it)
)

With this replacement: \1

online demo

The \G feature ensures that matches are consecutive. Two ways are used to break the contiguity:

  • a lookahead (?=-->)
  • the backtracking control verbs (*SKIP)(*FAIL) that forces the pattern to fail and all characters matched before to not be retried.

So when contiguity is broken or at the begining the first main branch will fail (cause of the \G anchor) and the second branch will be used.

\K removes all on the left from the match result.

(*ACCEPT) makes the pattern succeed unconditionnaly.

This pattern uses massively the branch reset feature (?|...(..)...|...(..)...|...), so all capturing groups have the same number (in other words there is only one group, the group 1.)

Note: even this pattern is long, it needs few steps to obtain a match. The impact of non-greedy quantifiers is reduced as much as possible, and each alternatives are sorted and as efficient as possible. One of the goals is to reduce the total number of matches needed to treat a string.

like image 177
Casimir et Hippolyte Avatar answered Nov 15 '22 06:11

Casimir et Hippolyte