Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error Tolerant HTML/XML/SGML parsing in PHP

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML

<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>

I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.

My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.

$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....

The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).

Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?

(if it's not obvious, I don't consider regular expressions a valid solution here)

Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.

like image 869
Alan Storm Avatar asked Sep 15 '08 20:09

Alan Storm


2 Answers

You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:

libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);

If, for some reason, you need access to the warnings, use libxml_get_errors

like image 169
troelskn Avatar answered Oct 05 '22 20:10

troelskn


I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.

like image 24
Paul Dixon Avatar answered Oct 05 '22 19:10

Paul Dixon