I'm trying to parse an XML document using a regex tokenizer in Python (It's a finite set, so regex is just fine!), and I'm having trouble matching comments properly.
The format of these comments is in the form <!--This is a comment-->
where the comment itself can contain all sorts of non-alphanumeric characters (including '-')
I want to match them in such a way that I break the comment up into the following tokens:
<!--
This is a comment
-->
The beginning marker is easy enough to get, and I successfully grab that with another regex, but the comment regex itself is getting too greedy and grabbing the --
from the end marker. I want this regex to grab strings that are not necessarily enclosed in a comment as well, though, so it should also be able to take <Tag>This is text</Tag>
and correctly return This is text
.
This is the regex I am currently using for the text:
[^<>]+(?!-->)
The end result ends up being This is a comment--
, when I just want This is a comment
so that my other regex can grab the -->
. This regex does work for normal tags, however, due to the existence of the '<' on the end tag and properly returns This is text
from my previous example.
I know I must not be using negative lookahead properly. Any ideas on what I am doing wrong here? I've tried [^<>]+(?=-->)
, but then that doesn't match anything that's not a comment of this form (like the normal tags). I figured that the (?!-->)
would stop the matching when it sees that pattern, but it doesn't seem to work like that and just continues matching until it sees the ending '>'.
Posting a segment of code for context:
xml_scanner = re.Scanner([
(r" ", lambda scanner,token:("INDENT", token)),
(r"<[A-Za-z\d._]+(?!\/)>", lambda scanner,token:("BEGINTAG", token)),
(r"<\/[A-Za-z\d._]+(?!\/)>", lambda scanner,token:("ENDTAG", token)),
(r"<[A-Za-z\d._]+\/>", lambda scanner,token:("INLINETAG", token)),
(r"<!--", lambda scanner,token:("BEGINCOMMENT", token)),
(r"-->", lambda scanner,token:("ENDCOMMENT", token)),
(r"[^<>]+(?!-->)", lambda scanner,token:("DATA", token)),
(r"\r$", None),
])
for line in database_file:
results, remainder = xml_scanner.scan(line)
This is the only thing the script is doing at the moment.
Your problem is that in [^<>]+(?!-->)
, the lookahead assertion is only tried after This is a comment--
has been matched, and of course it succeeds because there is no -->
at the position where the regex engine has ended up.
Therefore, you have to check the lookahead at every position of the string. The proper idiom for this is generally:
(?:(?!-->).)*
or, in your case
(?:(?!-->)[^<>])*
This matches any number of characters except angle brackets until either -->
occurs or the end of the string is reached.
You need to use a positive lookahead, to ensure that this pattern is ahead of your match.
[^<>]+(?=-->)
If you want to make it more universal, so that it matches content of other tags too, you can use an alternation inside the lookahead assertion:
[^<>]+(?=-->|</)
See it here on Regexr
This regex will stop, if there is "-->
" or "</
" ahead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With