Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex to remove doctype

I am looking for a regex to strip the following doctype declarations from a set of xml documents:

<!DOCTYPE refentry [ <!ENTITY % mathent SYSTEM "math.ent"> %mathent; ]>

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook MathML Module V1.1b1//EN"
          "http://www.oasis-open.org/docbook/xml/mathml/1.1CR1/dbmathml.dtd">

This is a very common question on stackoverflow and elsewhere, but none of the answers are actually able to deal with both cases.

My naive approach of <!DOCTYPE((.|\n|\r)*?)(\"|])> will correctly match the second case, but fail on the first one (it stops at the first "> and leaves %mathen; ]> unmatched.) If I try to make the regex more greedy, it tries to consume the whole document instead.

Complete test cases:

  • first
  • second
like image 241
The Fiddler Avatar asked Mar 29 '14 16:03

The Fiddler


1 Answers

EDIT: Fixed the comment match, thanks TheFiddler

Well, you could use something like (the not entirely beautiful);

<!DOCTYPE[^>[]*(\[[^]]*\])?>

It matches a <! and everything up to a > or [, followed by an optional section surrounded by [], followed by a final >.

A JSfiddle to test with.

More detail;

<!DOCTYPE     -- matches the string <!DOCTYPE
[^>[]*        -- matches anything up to a > or [
(\[[^]]*\])?  -- matches an optional section surrounded by []
>             -- matches the string >
like image 80
Joachim Isaksson Avatar answered Nov 12 '22 01:11

Joachim Isaksson