Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Python and lxml to validate XML against an external DTD

I'm trying to validate an XML file against an external DTD referenced in the doctype tag. Specifically:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
...the rest of the document...

I'm using Python 3.3 and the lxml module. From reading http://lxml.de/validation.html#validation-at-parse-time, I've thrown this together:

enexFile = open(sys.argv[2], mode="rb") # sys.argv[2] is the path to an XML file in local storage.
enexParser = etree.XMLParser(dtd_validation=True)
enexTree = etree.parse(enexFile, enexParser)

From what I understand of validation.html, the lxml library should now take care of retrieving the DTD and performing validation. But instead, I get this:

$ ./mapwrangler.py validate notes.enex
Traceback (most recent call last):
  File "./mapwrangler.py", line 27, in <module>
    enexTree = etree.parse(enexFile, enexParser)
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
  File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
  File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
  File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
  File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: Validation failed: no DTD found !, line 3, column 43

This surprises me, because if I turn off validation, then the document parses in just fine and I can do print(enexTree.docinfo.doctype) to get

$ ./mapwrangler.py validate notes.enex
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">

So it looks to me like there shouldn't be any problem finding the DTD.

Thanks for your help.

like image 945
DanielF Avatar asked Oct 20 '25 09:10

DanielF


1 Answers

You need to add no_network=False when constructing the parser object. This option is set to True by default.

From the documentation of parser options at http://lxml.de/parsing.html#parsers:

no_network - prevent network access when looking up external documents (on by default)

like image 163
mzjn Avatar answered Oct 22 '25 00:10

mzjn



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!