Is there a way, when I parse an XML document using lxml, to validate that document against its DTD using an external catalog file? I need to be able to work the fixed attributes defined in a document’s DTD.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
The Python standard library provides a minimal but useful set of interfaces to work with XML. The two most basic and broadly used APIs to XML data are the SAX and DOM interfaces. Simple API for XML (SAX) − Here, you register callbacks for events of interest and then let the parser proceed through the document.
You can add the catalog to the XML_CATALOG_FILES
environment variable:
os.environ['XML_CATALOG_FILES'] = 'file:///to/my/catalog.xml'
See this thread. Note that entries in XML_CATALOG_FILES
are space-separated URLs. You can use Python's pathname2url
and urljoin
(with file:
) to generate the URL from a pathname.
Can you give an example? According to the lxml validation docs, lxml can handle DTD validation (specified in the XML doc or externally in code) and system catalogs, which covers most cases I can think of.
f = StringIO("<!ELEMENT b EMPTY>")
dtd = etree.DTD(f)
dtd = etree.DTD(external_id = "-//OASIS//DTD DocBook XML V4.2//EN")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With