I am looking for a regex to strip the following doctype declarations from a set of xml documents:
<!DOCTYPE refentry [ <!ENTITY % mathent SYSTEM "math.ent"> %mathent; ]>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook MathML Module V1.1b1//EN"
"http://www.oasis-open.org/docbook/xml/mathml/1.1CR1/dbmathml.dtd">
This is a very common question on stackoverflow and elsewhere, but none of the answers are actually able to deal with both cases.
My naive approach of <!DOCTYPE((.|\n|\r)*?)(\"|])>
will correctly match the second case, but fail on the first one (it stops at the first ">
and leaves %mathen; ]>
unmatched.) If I try to make the regex more greedy, it tries to consume the whole document instead.
Complete test cases:
EDIT: Fixed the comment match, thanks TheFiddler
Well, you could use something like (the not entirely beautiful);
<!DOCTYPE[^>[]*(\[[^]]*\])?>
It matches a <!
and everything up to a >
or [
, followed by an optional section surrounded by []
, followed by a final >
.
A JSfiddle to test with.
More detail;
<!DOCTYPE -- matches the string <!DOCTYPE
[^>[]* -- matches anything up to a > or [
(\[[^]]*\])? -- matches an optional section surrounded by []
> -- matches the string >
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With