Error Tolerant HTML/XML/SGML parsing in PHP

Question

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML

<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>

I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.

My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.

$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....

The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).

Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?

(if it's not obvious, I don't consider regular expressions a valid solution here)

Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.

troelskn · Accepted Answer

You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:

libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);

If, for some reason, you need access to the warnings, use libxml_get_errors

Paul Dixon · Answer

I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.

Error Tolerant HTML/XML/SGML parsing in PHP

Tags:

html

php

parsing

xml

Alan Storm

2 Answers

troelskn

Paul Dixon

Recent Activity

Donate For Us

Error Tolerant HTML/XML/SGML parsing in PHP

Tags:

html

php

parsing

xml

Alan Storm

2 Answers

troelskn

Paul Dixon

Related questions

Recent Activity

Donate For Us