Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validating XML with XSDs ... but still allow extensibility

Maybe it's me, but it appears that if you have an XSD

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="User">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="GivenName" />
                <xs:element name="SurName" />
            </xs:sequence>
            <xs:attribute name="ID" type="xs:unsignedByte" use="required" />
        </xs:complexType>
    </xs:element>
</xs:schema>

that defines the schema for this document

<?xml version="1.0" encoding="utf-8" ?>
<User ID="1">
    <GivenName></GivenName>
    <SurName></SurName>
</User>

It would fail to validate if you added another element, say EmailAddress, and mix up the order

<?xml version="1.0" encoding="utf-8" ?>
<User ID="1">
    <SurName></SurName>
    <EmailAddress></EmailAddress>
    <GivenName></GivenName>
</User>

I don't want to add EmailAddress to the document and have it be marked optional.

I just want an XSD that validates the bare minimum requirements that the document must meet.

Is there a way to do this?

EDIT:

marc_s pointed out below that you can use xs:any inside of xs:sequence to allow more elements, unfortunately, you have to maintain the order of elements.

Alternatively, I can use xs:all which doesn't enforce the order of elements, but alas, doesn't allow me to place xs:any inside of it.

like image 726
CaffGeek Avatar asked Jul 27 '10 20:07

CaffGeek


People also ask

Can we validate XML documents against so schema?

You can validate your XML documents against XML schemas only; validation against DTDs is not supported. However, although you cannot validate against DTDs, you can insert documents that contain a DOCTYPE or that refer to DTDs.

What is the benefit of validating an XML document on sender side?

At the sender's side, the validating software should be installed at the document generation point. Thus, each business message is validated and the possible errors can be corrected before sending.

How can XML documents be validated?

To validate the XML in the DOM, you can validate the XML as it is loaded into the DOM by passing a schema-validating XmlReader to the Load method of the XmlDocument class, or validate a previously unvalidated XML document in the DOM using the Validate method of the XmlDocument class.


3 Answers

Your issue has a resolution, but it will not be pretty. Here's why:

Violation of non-deterministic content models

You've touched on the very soul of W3C XML Schema's. What you are asking — variable order and variable unknown elements — violates the hardest, yet most basic principle of XSD's, the rule of Non-Ambiguity, or, more formally, the Unique Particle Attribution Constraint:

A content model must be formed such that during validation [..] each item in the sequence can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

In normal English: when an XML is validated and the XSD processor encounters <SurName> it must be able to validate it without first checking whether it is followed by <GivenName>, i.e., no looking forward. In your scenario, this is not possible. This rule exists to allow implementations through Finite State Machines, which should make implementations rather trivial and fast.

This is one of the most-debated issues and is a heritage of SGML and DTD (content models must be deterministic) and XML, that defines, by default, that the order of elements is important (thus, trying the opposite, making the order unimportant, is hard).

As Marc_s already suggested, Relax_NG is an alternative that allows for non-deterministic content models. But what can you do if you're stuck with W3C XML Schema?

Non-working semi-valid solutions

You've already noticed that xs:all is very restrictive. The reason is simple: the same non-deterministic rule applies and that's why xs:any, min/maxOccurs larger then one and sequences are not allowed.

Also, you may have tried all sorts of combinations of choice, sequence and any. The error that the Microsoft XSD processor throws when encountering such invalid situation is:

Error: Multiple definition of element 'http://example.com/Chad:SurName' causes the content model to become ambiguous. A content model must be formed such that during validation of an element information item sequence, the particle contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

In O'Reilly's XML Schema (yes, the book has its flaws) this is excellently explained. Furtunately, parts of the book are available online. I highly recommend you read through section 7.4.1.3 about the Unique Particle Attribution Rule, their explanations and examples are much clearer than I can ever get them.

One working solution

In most cases it is possible to go from an undeterministic design to a deterministic design. This usually doesn't look pretty, but it's a solution if you have to stick with W3C XML Schema and/or if you absolutely must allow non-strict rules to your XML. The nightmare with your situation is that you want to enforce one thing (2 predefined elements) and at the same time want to have it very loose (order doesn't matter and anything can go between, before and after). If I don't try to give you good advice but just take you directly to a solution, it will look as follows:

<xs:element name="User">
    <xs:complexType>
        <xs:sequence>
            <xs:any minOccurs="0" processContents="lax" namespace="##other" />
            <xs:choice>
                <xs:sequence>                        
                    <xs:element name="GivenName" />
                    <xs:any minOccurs="0" processContents="lax" namespace="##other" />
                    <xs:element name="SurName" />
                </xs:sequence>
                <xs:sequence>
                    <xs:element name="SurName" />
                    <xs:any minOccurs="0" processContents="lax" namespace="##other" />
                    <xs:element name="GivenName" />
                </xs:sequence>
            </xs:choice>
            <xs:any minOccurs="0" processContents="lax" namespace="##any" />
        </xs:sequence>
        <xs:attribute name="ID" type="xs:unsignedByte" use="required" />
    </xs:complexType>
</xs:element>

The code above actually just works. But there are a few caveats. The first is xs:any with ##other as its namespace. You cannot use ##any, except for the last one, because that would allow elements like GivenName to be used in that stead and that means that the definition of User becomes ambiguous.

The second caveat is that if you want to use this trick with more than two or three, you'll have to write down all combinations. A maintenance nightmare. That's why I come up with the following:

A suggested solution, a variant of a Variable Content Container

Change your definition. This has the advantage of being clearer to your readers or users. It also has the advantage of becoming easier to maintain. A whole string of solutions are explained on XFront here, a less readable link you may have already seen from the post from Oleg. It's an excellent read, but most of it does not take into account that you have a minimum requirement of two elements inside the variable content container.

The current best-practice approach for your situation (which happens more often than you may imagine) is to split your data between the required and non-required fields. You can add an element <Required>, or do the opposite, add an element <ExtendedInfo> (or call it Properties, or OptionalData). This looks as follows:

<xs:element name="User2">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="GivenName" />
            <xs:element name="SurName" />
            <xs:element name="ExtendedInfo" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" namespace="##any" />
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

This may seem less than ideal at the moment, but let it grow a bit. Having an ordered set of fixed elements isn't that big a deal. You're not the only one who'll be complaining about this apparent deficiency of W3C XML Schema, but as I said earlier, if you have to use it, you'll have to live with its limitations, or accept the burden of developing around these limitations at a higher cost of ownership.

Alternative solution

I'm sure you know this already, but the order of attributes is by default undetermined. If all your content is of simple types, you can alternatively choose to make a more abundant use of attributes.

A final word

Whatever approach you take, you will lose a lot of verifiability of your data. It's often better to allow content providers to add content types, but only when it can be verified. This you can do by switching from lax to strict processing and by making the types themselves stricter. But being too strict isn't good either, the right balance will depend on your ability to judge the use-cases that you're up against and weighing that in against the trade-offs of certain implementation strategies.

like image 64
Abel Avatar answered Oct 20 '22 10:10

Abel


After reading of the answer of marc_s and your discussion in comments I decide to add a little.

It seems to me there are no perfect solution of your problem Chad. There are some approaches how to implement extensible content model in XSD, but all me known implementation have some restrictions. Because you didn't write about the environment where you plan to use extensible XSD I can you only recommend some links which probably will help you to choose the way which can be implemented in your environment:

  1. http://www.xfront.com/ExtensibleContentModels.html (or http://www.xfront.com/ExtensibleContentModels.pdf) and http://www.xfront.com/VariableContentContainers.html
  2. http://www.xml.com/lpt/a/993 (or http://www.xml.com/pub/a/2002/07/03/schema_design.html)
  3. http://msdn.microsoft.com/en-us/library/ms950793.aspx
like image 37
Oleg Avatar answered Oct 20 '22 10:10

Oleg


You should be able to extend your schema with the <xs:any> element for extensibility - see W3Schools for details.

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="User">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="GivenName" />
                <xs:element name="SurName" />
                <xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" />
            </xs:sequence>
            <xs:attribute name="ID" type="xs:unsignedByte" use="required" />
        </xs:complexType>
    </xs:element>
</xs:schema>

When you add the processContents="lax" then the .NET XML validation should succeed on it.

See MSDN docs on xs:any for more details.

Update: if you require more flexibility and less stringent validation, you might want to look at other methods of defining schemas for your XML - something like RelaxNG. XML Schema is - on purpose - rather strict about its rules, so maybe that's just the wrong tool for this job at hand.

like image 4
marc_s Avatar answered Oct 20 '22 11:10

marc_s