Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you handle malformed HTML in Perl?

I'm interested in a parser that could take a malformed HTML page, and turn it into well formed HTML before performing some XPath queries on it. Do you know of any?

like image 314
Geo Avatar asked Oct 27 '09 20:10

Geo


1 Answers

You should not use an XML parser to parse HTML. Use an HTML parser.

Note that the following is perfectly valid HTML (and an XML parser would choke on it):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
    "http://www.w3.org/TR/html4/strict.dtd">

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Is this valid?</title>
</head>

<body>

<p>This is a paragraph

<table>

<tr>  <td>cell 1  <td>cell 2
<tr>  <td>cell 3  <td>cell 4

</table>

</body>

</html>

There are many task specific (in addition to the general purpose) HTML parsers on CPAN. They have worked perfectly for me on an immense variety of extremely messy (and most of the time invalid) HTML.

It would be possible to give specific recommendations if you can specify the problem you are trying to solve.

There is also HTML::TreeBuilder::XPath which uses HTML::Parser to parse the document into a tree and then allows you to query it using XPath. I have never used it but see Randal Schwartz's HTML Scraping with XPath.

Given the HTML file above, the following short script:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;

$tree->parse_file("valid.html");
my @td = $tree->findnodes_as_strings('//td');

print $_, "\n" for @td;

outputs:

C:\Temp> z
cell 1
cell 2
cell 3
cell 4

The key point here is that the document was parsed by an HTML parser as an HTML document (despite the fact that we were able to query it using XPath).

like image 184
Sinan Ünür Avatar answered Sep 22 '22 20:09

Sinan Ünür