Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a PHP script that can clean up bad HTML

I am in the process of writing a PHP command line script to convert hundreds of HTML snippets into Markdown using the Markdownify library. However, I have come across a situation where some of my HTML is not structured well enough to be used with Markdownify. So I first need to send my HTML through some library that can clean it up and add optional closing tags, etc. I will be working with partial blocks of HTML, not complete HTML documents, so the HTML that is returned must be partial (and not include the doctype, etc).

Do you know of a PHP script that can convert HTML to XHTML?

Solution:

Utilize the PHP DOMDocument class. It will format your HTML even if it is broken. Then you can extract the cleaned up HTML:

libxml_use_internal_errors(true); //use this to prevent warning messages from displaying because of the bad HTML

$doc = new DOMDocument();
$doc->loadHTML($badHtml);
$goodHtml = $doc->saveHTML();

This will return a full HTML document (with the cleaned up version in the body tag), even though I passed it a partial block of HTML, so I can extract the cleaned up partial with this regex:

$goodHtmlPartial = trim(ereg_replace('(.*)<body>(.*)</body>(.*)', '\2', $goodHtml));
like image 843
Andrew Avatar asked Dec 08 '10 00:12

Andrew


1 Answers

Any reason not to use tidy ?

http://php.net/manual/en/book.tidy.php

It can clean up your html and give you only the body section.

$tidy = tidy_repair_string($content,array(
                           'indent'         => true,
                           'output-html'   => true,
                           'wrap'           => 80,
                           'show-body-only' => true,
                           'clean' => true,
                           'input-encoding' => 'utf8',
                           'output-encoding' => 'utf8',
                           'logical-emphasis' => false,
                           'bare' => true,
                                          ));
like image 193
Yisrael Dov Avatar answered Nov 12 '22 16:11

Yisrael Dov