Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I prevent XML::XPath from fetching a DTD while processing an XML file?

Tags:

xml

perl

dtd

My XML (a.xhtml) starts like this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...

My code starts like this

use XML::XPath;

use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => "a.xhtml");

my $nodeset = $xp->find('/html/body//table'); 

It's very slow, and it turns out that it spends a lot of time getting the DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).

Is there a way to explicitly declare an HTTP proxy server in the Perl XML:: family? I hate to modify the original a.xhtml document like having a local copy of the DTD.

like image 799
yogman Avatar asked Nov 19 '08 21:11

yogman


People also ask

How is XML parsed?

The parser reads an XML document from the beginning to the end. When it encounters a node in the document, it generates an event that triggers the corresponding event handler for that node. The handler thus applies the application logic to process the node specifically.


2 Answers

XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser.

So you can write this:

my $p = XML::Parser->new( NoLWP => 1);
my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml");

Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including &alpha; in the table and printing it for example).

As a matter of fact you probably should not use XML::XPath, which is not actively maintained.

Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.

like image 120
mirod Avatar answered Oct 25 '22 08:10

mirod


porneL's response seems to be the Right Thing here. (www.w3.org has started taking 30 seconds to respond to each of my queries (when it doesn't just give up), and when XML::XPath ends up retrieving the full XHTML set…!) Further, mirod's idea works, too:

use XML::XPath;
use XML::Catalog;

my $parser = new XML::Parser;
my $catalog_handler = new XML::Catalog("xhtml1-20020801/DTD/xhtml.soc")->get_handler($parser);
$parser->setHandlers("ExternEnt" => $catalog_handler);
my $xp = new XML::XPath(xml => $xml, parser => $parser);

Add a copy of "The complete set of DTD files together with an XML declaration and SGML Open Catalog" from 〈URL:http://www.w3.org/TR/xhtml1/dtds.html〉 and enjoy!

like image 39
Anonymous Coward Avatar answered Oct 25 '22 08:10

Anonymous Coward