Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validate with three xml schemas as one combined schema in lxml?

Tags:

python

xml

lxml

xsd

I am generating an XML document for which different XSDs have been provided for different parts (which is to say, definitions for some elements are in certain files, definitions for others are in others).

The XSD files do not refer to each other. The schemas are:

  1. http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd
  2. http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/FormSubmission-v1-1.xsd
  3. http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd

Is there a way to validate the document against all of the schemas using lxml?

The solution here is not simply to validate individually against each schema, because the problem I am having is that validation fails because of elements not specified in the XSD. For example, when validating against http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd, I get the error:

  File "lxml.etree.pyx", line 3006, in lxml.etree._Validator.assertValid (src/lxml/lxml.etree.c:125415)
DocumentInvalid: Element '{http://xmlgw.companieshouse.gov.uk}CompanyIncorporation': No matching global element declaration available, but demanded by the strict wildcard., line 9

Because the document in question contains a {http://xmlgw.companieshouse.gov.uk}CompanyIncorporation element, which is not specified in the XSD being validated against, but in one of the other XSD files.

like image 315
Marcin Avatar asked Mar 01 '12 20:03

Marcin


1 Answers

I believe you should only be validating against Egov_ch-v2-0.xsd, which appears to define an envelope document. (This is the document you are creating, right? You haven't showed your XML.)

This schema uses <xs:any namespace="##any" minOccurs="0"/> to define body contents of the envelope. However, xsd:any does not mean "ignore all contents." Rather it means "accept anything here." Whether to validate or ignore the contents is controlled by the processContents attribute, which defaults to strict. This means that any elements discovered here must validate against types available to the schema. However, Egov_ch-v2-0.xsd does not import CompanyIncorporation-v1-2.xsd, so it doesn't know about the CompanyIncorporation element, so the document does not validate.

You need to add xsd:import elements to your main schema (Egov_ch-v2-0.xsd) to import all other schemas that may be used in the document. You can either do this in the xsd file itself, or you can add the elements programmatically after parsing:

xsd = lxml.etree.parse('http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd')
newimport = lxml.etree.Element('{http://www.w3.org/2001/XMLSchema}import',
    namespace="http://xmlgw.companieshouse.gov.uk",
    schemaLocation="http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd")
xsd.getroot().append(newimport)

validator = lxml.etree.XMLSchema(xsd)

You can even do this in a generic way with a function that takes a list of schema paths and returns a list of xsd:import statements with namespace and schemaLocation set by parsing targetNamespace.

(As an aside, you should probably download these schema documents and reference them with filesystem paths rather than load them over the network.)

like image 100
Francis Avila Avatar answered Oct 23 '22 03:10

Francis Avila