My XML (a.xhtml) starts like this <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... </code></pre> My code starts like this <pre class="prettyprint"><code>use XML::XPath; use XML::XPath::XMLParser; my $xp = XML::XPath->new(filename => "a.xhtml"); my $nodeset = $xp->find('/html/body//table'); </code></pre> It's very slow, and it turns out that it spends a lot of time getting the DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd). Is there a way to explicitly declare an HTTP proxy server in the Perl XML:: family? I hate to modify the original a.xhtml document like having a local copy of the DTD.

XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser. So you can write this: <pre class="prettyprint"><code>my $p = XML::Parser->new( NoLWP => 1); my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml"); </code></pre> Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including &alpha; in the table and printing it for example). As a matter of fact you probably should not use XML::XPath, which is not actively maintained. Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.

How can I prevent XML::XPath from fetching a DTD while processing an XML file?

Tags:

xml

perl

dtd

My XML (a.xhtml) starts like this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...

My code starts like this

use XML::XPath;

use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => "a.xhtml");

my $nodeset = $xp->find('/html/body//table');

It's very slow, and it turns out that it spends a lot of time getting the DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).

Is there a way to explicitly declare an HTTP proxy server in the Perl XML:: family? I hate to modify the original a.xhtml document like having a local copy of the DTD.

799

asked Nov 19 '08 21:11

yogman

2 Answers

XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser.

So you can write this:

my $p = XML::Parser->new( NoLWP => 1);
my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml");

Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including α in the table and printing it for example).

As a matter of fact you probably should not use XML::XPath, which is not actively maintained.

Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.

120

answered Oct 25 '22 08:10

mirod

porneL's response seems to be the Right Thing here. (www.w3.org has started taking 30 seconds to respond to each of my queries (when it doesn't just give up), and when XML::XPath ends up retrieving the full XHTML set…!) Further, mirod's idea works, too:

use XML::XPath;
use XML::Catalog;

my $parser = new XML::Parser;
my $catalog_handler = new XML::Catalog("xhtml1-20020801/DTD/xhtml.soc")->get_handler($parser);
$parser->setHandlers("ExternEnt" => $catalog_handler);
my $xp = new XML::XPath(xml => $xml, parser => $parser);

Add a copy of "The complete set of DTD files together with an XML declaration and SGML Open Catalog" from 〈URL:http://www.w3.org/TR/xhtml1/dtds.html〉 and enjoy!

answered Oct 25 '22 08:10

Anonymous Coward

Related questions
                            
                                How can libxml2 be used to parse data from XML?
                            
                                How do I create a header or footer button bar for my Android application
                            
                                Preserving whitespace in PDF after XSL transform
                            
                                How do you nest complexType elements in an xsd?
                            
                                Number rounding and precision problems in XSLT 1.0
                            
                                How to define a user define data type in XML schema?
                            
                                What is meaning of .// in XPath?
                            
                                How to instantiate an empty element with JAXB
                            
                                Parse XML string using SAX
                            
                                How do I modify WCF to process messages in a different (non SOAP) format?
                            
                                How do I edit XML using Powershell?
                            
                                Update XAttribute Value where XAttribute Name = X
                            
                                Groovy XmlSlurper: Find elements in XML structure
                            
                                Make JAXB go faster
                            
                                How to I use TryParse in a linq query of xml data?
                            
                                What is the XPath expression to select a Processing instruction?
                            
                                How to not load the comments while parsing XML in lxml
                            
                                Hexadecimal value 0x00 is a invalid character loading XML document
                            
                                Why use an empty element in XML?
                            
                                Parse XML doc (Clinical Document Architecture-CDA,HL7 standard) using Everest Framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With