Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent XSD validation of nested elements using `<xs:any>`

I'm working on a tool to help a user author XHTML-ish documents which are similar in nature to JSP files. The documents are XML and can contain any well-formed tags in the XHTML namespace, and weaved between them are elements from my product's namespace. Among other things, the tool validates the input using XSD.

Example input:

<?xml version="1.0"?>
<markup>
  <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
    <c:section>
      <c:paragraph>
        <span>This is a test!</span>
        <a href="http://www.google.com/">click here for more!</a>
      </c:paragraph>
    </c:section>
  </html>
</markup>

My problem is that the XSD validation doesn't behave consistently depending on how deeply I nest elements. What I want is for all elements in the https://my_tag_lib.example.com/ namespace to be checked against the schema while any elements in namespace http://www.w3.org/1999/xhtml to be liberally tolerated. I would like to not list all HTML elements which are permitted in my XSD - users may want to use obscure elements only available on certain browsers etc. Instead I'd just like to white list any element belonging to the namespace using <xs:any>.

What I'm discovering is that under some circumstances, elements which belong to the my_tag_lib namespace but don't appear in the schema are passing validation, while other elements which do appear in the schema can be made to fail by giving them invalid attributes.

So: * valid elements are validated against the XSD schema * invalid elements are skipped by the validator?

For example, this passes validation:

<?xml version="1.0"?>
<markup>
  <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
    <c:section>
      <div>
        <c:my-invalid-element>This is a test</c:my-invalid-element>
      </div>
    </c:section>
  </html>
</markup>

But then this fails validation:

<?xml version="1.0"?>
<markup>
  <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
    <c:section>
      <div>
        <c:paragraph my-invalid-attr="true">This is a test</c:paragraph>
      </div>
    </c:section>
  </html>
</markup>

Why are the attributes being validated against the schema for recognized elements, while unrecognized elements are seemingly not getting sanitized at all? What's the logic here? I've been using xmllint to do the validation:

xmllint --schema markup.xsd example.xml

Here are my XSD files:

File: markup.xsd

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <xs:import namespace="http://www.w3.org/1999/xhtml" schemaLocation="html.xsd" />
  <xs:element name="markup">
    <xs:complexType mixed="true">
      <xs:sequence>
        <xs:element ref="xhtml:html" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

File: html.xsd

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.w3.org/1999/xhtml">
  <xs:import namespace="https://my_tag_lib.example.com/" schemaLocation="my_tag_lib.xsd" />
  <xs:element name="html">
    <xs:complexType mixed="true">
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:any processContents="lax" namespace="http://www.w3.org/1999/xhtml" />
        <xs:any processContents="strict" namespace="https://my_tag_lib.example.com/" />
      </xs:choice>
    </xs:complexType>
  </xs:element>
</xs:schema>

File: my_tag_lib.xsd

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="https://my_tag_lib.example.com/">
  <xs:element name="section">
    <xs:complexType mixed="true">
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:any processContents="lax" namespace="http://www.w3.org/1999/xhtml" />
        <xs:any processContents="strict" namespace="https://my_tag_lib.example.com/" />
      </xs:choice>
    </xs:complexType>
  </xs:element>
  <xs:element name="paragraph">
    <xs:complexType mixed="true">
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:any processContents="lax" namespace="http://www.w3.org/1999/xhtml" />
        <xs:any processContents="strict" namespace="https://my_tag_lib.example.com/" />
      </xs:choice>
    </xs:complexType>
  </xs:element>
</xs:schema>
like image 380
Richard JP Le Guen Avatar asked Apr 01 '14 01:04

Richard JP Le Guen


Video Answer


1 Answers

What you're missing is understanding of the context determined declaration.

First, have a look at this little experiment.

<?xml version="1.0"?>
<markup>
    <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
        <c:section>
            <div>
                <html>
                    <c:my-invalid-element>This is a test</c:my-invalid-element>
                </html>
            </div>
        </c:section>
    </html>
</markup>

This is the same as your valid example, except that now I've changed the context in which c:my-invalid-element is being assessed from "lax" to "strict". This is done by interjecting the html element, which now forces all the elements in your tag namespace to be strict. As you can easily confirm, the above is invalid.

This tells you (without reading the documentation) that in your examples, the determined context must have been "lax" as opposed to your expectation, which is "strict".

Why is the context lax? div is processed "laxly" (it matches the wildcard, but no definition exists for it), hence it's children will be assessed laxly. Matching with what lax means: in the first case, a definition for c:my-invalid-element was not found, therefore the instruction given is don't worry if you can't - all good. In the invalid sample, a definition for c:paragraph can be found, hence it must be ·valid· with respect to that definition - not good, because of the unexpected attribute.

like image 142
Petru Gardea Avatar answered Sep 28 '22 00:09

Petru Gardea