Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse invalid HTML with Perl?

I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:

<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>

I tried using HTML::TreeBuilder to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div> are gone. I'm left with just <div class="highlight"></div>.

The editors often have also done things like:

<div class="article"><style>@font-face {   font-family: "Cambria"; }</style>Article starts here</div>

Parsing this with HTML::TreeBuilder results in empty <div class="article"></div> again.

Any ideas how to approach this broken HTML and actually make sense out of it?

like image 345
bodacydo Avatar asked Jul 04 '12 21:07

bodacydo


People also ask

How do you parse HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

What is parsing error in HTML?

Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.


2 Answers

I would first run it through HTML::Tidy:

#!/usr/bin/env perl

use strict; use warnings;
use HTML::Tidy;

my $html = <<EO_HTML;
<div class="highlight"><html><head></head>
<body><p>Note that ...</p></html>
</div>
EO_HTML

my $tidy = HTML::Tidy->new;

print $tidy->clean( $html );

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta name="generator" content="tidyp for Windows (v1.04), see www.w3.org">
<title></title>
</head>
<body>
<div class="highlight">
<p>Note that ...</p>
</div>
</body>
</html>

You can control the output by setting various configuration options.

Then, feed the cleaned HTML through a parser.

Otherwise, you can try building a tree one-step-at-a-time using HTML::TokeParser::Simple or even just HTML::Parser, but I believe that way lies insanity.

Keep in mind that a parser that tries to build a tree representation will be stricter than a stream parser that just recognizes various elements as it sees them.

like image 163
Sinan Ünür Avatar answered Oct 21 '22 21:10

Sinan Ünür


You can try to use Marpa::HTML, which is a high level HTML parser, allowing extremely liberal parsing. It can parse even invalid HTML using technique called ruby slippers by its author; Marpa::HTML adds element that should be there.

See an example of reformatting, prettifying and making valid of example invalid HTML in How to Parse HTML blog post by Jeffrey Kegler, author of Marpa parser and Marpa::HTML.

like image 22
Jakub Narębski Avatar answered Oct 21 '22 22:10

Jakub Narębski