<p>I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:</p> <pre class="prettyprint"><code><div class="highlight"><html><head></head><body><p>Note that ...</p></html></div> </code></pre> <p>I tried using <code>HTML::TreeBuilder</code> to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <code><div class="highlight">...</div></code> are gone. I'm left with just <code><div class="highlight"></div></code>.</p> <p>The editors often have also done things like:</p> <pre class="prettyprint"><code><div class="article"><style>@font-face { font-family: "Cambria"; }</style>Article starts here</div> </code></pre> <p>Parsing this with <code>HTML::TreeBuilder</code> results in empty <code><div class="article"></div></code> again.</p> <p>Any ideas how to approach this broken HTML and actually make sense out of it?</p>

<p>I would first run it through HTML::Tidy:</p> <pre class="prettyprint"><code>#!/usr/bin/env perl use strict; use warnings; use HTML::Tidy; my $html = <<EO_HTML; <div class="highlight"><html><head></head> <body><p>Note that ...</p></html> </div> EO_HTML my $tidy = HTML::Tidy->new; print $tidy->clean( $html ); </code></pre> <p>Output:</p> <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head> <meta name="generator" content="tidyp for Windows (v1.04), see www.w3.org"> <title></title> </head> <body> <div class="highlight"> <p>Note that ...</p> </div> </body> </html> </code></pre> <p>You can control the output by setting various configuration options.</p> <p>Then, feed the cleaned HTML through a parser.</p> <p>Otherwise, you can try building a tree one-step-at-a-time using HTML::TokeParser::Simple or even just HTML::Parser, but I believe that way lies insanity.</p> <p>Keep in mind that a parser that tries to build a tree representation will be stricter than a stream parser that just recognizes various elements as it sees them.</p>

How to parse invalid HTML with Perl?

Tags:

html

parsing

html-parsing

perl

I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:

<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>

I tried using HTML::TreeBuilder to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div> are gone. I'm left with just <div class="highlight"></div>.

The editors often have also done things like:

<div class="article"><style>@font-face {   font-family: "Cambria"; }</style>Article starts here</div>

Parsing this with HTML::TreeBuilder results in empty <div class="article"></div> again.

Any ideas how to approach this broken HTML and actually make sense out of it?

345

asked Jul 04 '12 21:07

bodacydo

2 Answers

I would first run it through HTML::Tidy:

#!/usr/bin/env perl

use strict; use warnings;
use HTML::Tidy;

my $html = <<EO_HTML;
<div class="highlight"><html><head></head>
<body><p>Note that ...</p></html>
</div>
EO_HTML

my $tidy = HTML::Tidy->new;

print $tidy->clean( $html );

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta name="generator" content="tidyp for Windows (v1.04), see www.w3.org">
<title></title>
</head>
<body>
<div class="highlight">
<p>Note that ...</p>
</div>
</body>
</html>

You can control the output by setting various configuration options.

Then, feed the cleaned HTML through a parser.

Otherwise, you can try building a tree one-step-at-a-time using HTML::TokeParser::Simple or even just HTML::Parser, but I believe that way lies insanity.

Keep in mind that a parser that tries to build a tree representation will be stricter than a stream parser that just recognizes various elements as it sees them.

163

answered Oct 21 '22 21:10

Sinan Ünür

You can try to use Marpa::HTML, which is a high level HTML parser, allowing extremely liberal parsing. It can parse even invalid HTML using technique called ruby slippers by its author; Marpa::HTML adds element that should be there.

See an example of reformatting, prettifying and making valid of example invalid HTML in How to Parse HTML blog post by Jeffrey Kegler, author of Marpa parser and Marpa::HTML.

answered Oct 21 '22 22:10

Jakub Narębski

Related questions
                            
                                Including multiple authors using HTML meta tags, and how to ensure Google use meta description instead of page content
                            
                                Race conditions with JavaScript event handling?
                            
                                When is Request.Form["name"] null and when an empty string?
                            
                                ASP.NET MVC 3: Support for HTML5 multiple file upload?
                            
                                Delay when loading CSS background images
                            
                                Is it safe to remove the space between 'content' and 'charset'?
                            
                                Make hyperlink underline appear on mouseover (currently the opposite)
                            
                                Symfony2,Doctrine Extensions Tree : Generating a "tree"-like dropdown Select list
                            
                                Div height with child margin
                            
                                html5 window.localStorage.getItemItem get keys that start with
                            
                                Aligning textboxes via HTML
                            
                                toDataURL not working on android browsers
                            
                                Exporting canvas to video [closed]
                            
                                jQuery livequery plug equivalent in jQuery 1.7+
                            
                                Web workers and accessing object attached to the window object
                            
                                how to change the text of an h1 with an input button without losing the button
                            
                                Checkbox inside button?
                            
                                htmlspecialchars outputting blank
                            
                                UIWebView baseURL and absolute path
                            
                                How to style a div content after fixed div header with dynamic height

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With