I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:
<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>
I tried using HTML::TreeBuilder
to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div>
are gone. I'm left with just <div class="highlight"></div>
.
The editors often have also done things like:
<div class="article"><style>@font-face { font-family: "Cambria"; }</style>Article starts here</div>
Parsing this with HTML::TreeBuilder
results in empty <div class="article"></div>
again.
Any ideas how to approach this broken HTML and actually make sense out of it?
HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.
I would first run it through HTML::Tidy:
#!/usr/bin/env perl
use strict; use warnings;
use HTML::Tidy;
my $html = <<EO_HTML;
<div class="highlight"><html><head></head>
<body><p>Note that ...</p></html>
</div>
EO_HTML
my $tidy = HTML::Tidy->new;
print $tidy->clean( $html );
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta name="generator" content="tidyp for Windows (v1.04), see www.w3.org">
<title></title>
</head>
<body>
<div class="highlight">
<p>Note that ...</p>
</div>
</body>
</html>
You can control the output by setting various configuration options.
Then, feed the cleaned HTML through a parser.
Otherwise, you can try building a tree one-step-at-a-time using HTML::TokeParser::Simple or even just HTML::Parser, but I believe that way lies insanity.
Keep in mind that a parser that tries to build a tree representation will be stricter than a stream parser that just recognizes various elements as it sees them.
You can try to use Marpa::HTML, which is a high level HTML parser, allowing extremely liberal parsing. It can parse even invalid HTML using technique called ruby slippers by its author; Marpa::HTML adds element that should be there.
See an example of reformatting, prettifying and making valid of example invalid HTML in How to Parse HTML blog post by Jeffrey Kegler, author of Marpa parser and Marpa::HTML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With