Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: Match a string up to a pattern (Which may not exist) [duplicate]

I'm trying to parse an XML document using a regex tokenizer in Python (It's a finite set, so regex is just fine!), and I'm having trouble matching comments properly.

The format of these comments is in the form <!--This is a comment--> where the comment itself can contain all sorts of non-alphanumeric characters (including '-')

I want to match them in such a way that I break the comment up into the following tokens:

<!--

This is a comment

-->

The beginning marker is easy enough to get, and I successfully grab that with another regex, but the comment regex itself is getting too greedy and grabbing the -- from the end marker. I want this regex to grab strings that are not necessarily enclosed in a comment as well, though, so it should also be able to take <Tag>This is text</Tag> and correctly return This is text.

This is the regex I am currently using for the text:

[^<>]+(?!-->)

The end result ends up being This is a comment--, when I just want This is a comment so that my other regex can grab the -->. This regex does work for normal tags, however, due to the existence of the '<' on the end tag and properly returns This is text from my previous example.

I know I must not be using negative lookahead properly. Any ideas on what I am doing wrong here? I've tried [^<>]+(?=-->), but then that doesn't match anything that's not a comment of this form (like the normal tags). I figured that the (?!-->) would stop the matching when it sees that pattern, but it doesn't seem to work like that and just continues matching until it sees the ending '>'.

Posting a segment of code for context:

xml_scanner = re.Scanner([
    (r"  ",             lambda scanner,token:("INDENT", token)),
    (r"<[A-Za-z\d._]+(?!\/)>",  lambda scanner,token:("BEGINTAG", token)),
    (r"<\/[A-Za-z\d._]+(?!\/)>",  lambda scanner,token:("ENDTAG", token)),
    (r"<[A-Za-z\d._]+\/>",      lambda scanner,token:("INLINETAG", token)),
    (r"<!--",               lambda scanner,token:("BEGINCOMMENT", token)),
    (r"-->",                lambda scanner,token:("ENDCOMMENT", token)),
    (r"[^<>]+(?!-->)",         lambda scanner,token:("DATA", token)),
    (r"\r$", None),
])

for line in database_file:
    results, remainder = xml_scanner.scan(line)

This is the only thing the script is doing at the moment.

like image 301
Austin Avatar asked Dec 16 '22 09:12

Austin


2 Answers

Your problem is that in [^<>]+(?!-->), the lookahead assertion is only tried after This is a comment-- has been matched, and of course it succeeds because there is no --> at the position where the regex engine has ended up.

Therefore, you have to check the lookahead at every position of the string. The proper idiom for this is generally:

(?:(?!-->).)*

or, in your case

(?:(?!-->)[^<>])*

This matches any number of characters except angle brackets until either --> occurs or the end of the string is reached.

like image 125
Tim Pietzcker Avatar answered May 16 '23 08:05

Tim Pietzcker


You need to use a positive lookahead, to ensure that this pattern is ahead of your match.

[^<>]+(?=-->)

If you want to make it more universal, so that it matches content of other tags too, you can use an alternation inside the lookahead assertion:

[^<>]+(?=-->|</)

See it here on Regexr

This regex will stop, if there is "-->" or "</" ahead.

like image 34
stema Avatar answered May 16 '23 09:05

stema