Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Following xs:include when parsing XSD as XML with lxml in Python

Tags:

python

xml

xsd

So, my problem is I'm trying to do something a little un-orthodox. I have a complicated set of XSD files. However I don't want to use these XSD files to verify an XML file; I want to parse these XSDs as XML and interrogate them just as I would a normal XML file. This is possible because XSDs are valid XML. I am using lxml with Python3.

The problem I'm having is with the statement:

<xs:include schemaLocation="sdm-extension.xsd"/>

If I instruct lxml to create an XSD for verifying like this:

schema = etree.XMLSchema(schema_root)

this dependency will be resolved (the file exists in the same directory as the one I've just loaded). HOWEVER, I am treating these as XML so, correctly, lxml just treats this as a normal element with an attribute and does not follow it.

Is there an easy or correct way to extend lxml so that I may have the same or similar behaviour as, say

<xi:include href="metadata.xml" parse="xml" xpointer="title"/>

I could, of course, create a separate xml file manually that includes all the dependencies in the XSD schema. That is perhaps a solution?

like image 321
Oni Avatar asked Nov 28 '25 11:11

Oni


2 Answers

So it seems like one option is to use the xi:xinclude method and create a separate xml file that includes all the XSDs I want to parse. Something along the lines of:

<fullxsd>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm1-0-0.xsd" parse="xml"/>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm-ns-structure.xsd" parse="xml"/>
</fullxsd>

Then use some lxml along the lines of

 def combine(xsd_file):
      with open(xsd_file, 'rb') as f_xsd:
          parser = etree.XMLParser(recover=True, encoding='utf-8',remove_comments=True,                    remove_blank_text=True)

          xsd_source = f_xsd.read()
          root = etree.fromstring(xsd_source, parser)
          incl = etree.XInclude()
          incl(root)

          print(etree.tostring(root, pretty_print=True))

Its not ideal but it seems the proper way. I've looked at custom URI parsers in the lxml but that would mean actually altering the XSDs which seems messier.

like image 145
Oni Avatar answered Nov 30 '25 00:11

Oni


Try this:

def validate_xml(schema_file, xml_file):
    xsd_doc = etree.parse(schema_file)
    xsd = etree.XMLSchema(xsd_doc)
    xml = etree.parse(xml_file)
    return xsd.validate(xml)
like image 33
Spithas Avatar answered Nov 30 '25 02:11

Spithas