Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to verify that one XSD schema is a subset of another XSD schema?

How can I verify that one XSD schema is a subset of another XSD schema?

We are creating a system-of-systems application using a collection of "blueprint" XSD schemas (which defines all possible inputs or outputs available to a subcomponent). Many subcomponents are being implemented, and these subcomponents pass data among themselves using XML files. Each subcomponent creates a subset of the relevant blueprint XSD schema (to indicate which of the possible inputs or output it has chosen to implement). Any XML datafile that validates against a subset XSD schema must also validate against the blueprint XSD schema, but the reverse is not true (as the subset XSD schema may not contain all "optional" or "choice" XML elements from the blueprint XSD schema, and it may choose to further restrict allowed data values on an existing XML tag). The system will validate all XML inputs to a subcomponent against that subcomponent's subset XSD schema (to flag any bad inputs and isolate the source of data-related problems).

During testing, we intend to verify that each subcomponent's subset XSD schema is truly a subset of the associated blueprint XSD schema, but we have no automated means of performing this verification. These XSD schemas are rather large and ugly to need to do this testing by hand. It would be nice to have a kind of "validate XSD file 1against XSD file 2" command, similar to how Java can perform a validation of an XML file against an XSD schema. We want to confirm that each subcomponent's subset XSD schema will not allow any combinations of XML input/output that would violate the blueprint XSD schema. With this schema-to-schema capability, it would also be very helpful to verify if the output XML from subcomponent A would be appropriate to be used as input to subcomponent B (we can easily validate a single output XML against a XSD schema, but we want to confirm that all possible XML outputs from subcomponent A will validate against subcomponent B's XSD schema).

Helpful information: This application is a collection of Java 6 applications implemented as OSGi bundles and compiled/executed using Maven 2.2.1. There are no requirements for using any specific development IDE. The system is being tested upon a Microsoft Windows XP environment, but there are plans to execute this system upon other environments as well (so a cross-platform solution would be preferred).

like image 313
navySV Avatar asked Jan 11 '13 20:01

navySV


People also ask

Which element is used to indicate that elements defined in the XSD?

The sequence element specifies that the child elements must appear in a sequence. Each child element can occur from 0 to any number of times.

What is minInclusive in XSD?

minExclusive. Specifies the lower bounds for numeric values (the value must be greater than this value) minInclusive. Specifies the lower bounds for numeric values (the value must be greater than or equal to this value) minLength.


1 Answers

The simplest way to ensure the relationship you want is to derive the types of the subset schemas by restriction from the types of the blueprint schema. It sounds as if that boat has already sailed, though.

Like others here, I am not aware of any tools that do this out of the box (although if Petru Gardea says QT Assistant can, it's worth following up).

One complication is that there are two different ways to view the subset/superset relation you want to verify: (1) every document (or element) accepted as valid by schema 1 is also accepted as valid by schema 2 (without reference to the type assignments made), or (2) the typed documents produced by validation (in what the spec calls the post-schema-validation infoset) against schemas 1 and 2 stand in an appropriate relation to each other: if an element or attribute is valid in tree 1, it's valid in tree 2; the type assigned to it in tree 1 is a restriction of the type assigned to it in tree 2; etc. If schemas 1 and 2 were developed independently, the chances that their types are related by derivation are poor, so I guess you have the first approach to the question in mind.

The problem, though, is definitely decidable, in either form. For any schema (I'm using the term carefully) there are by definition a finite number of types and a finite number of element names declared; it follows that there is a finite number (possibly large) of element name / type pairs.

The algorithm can go something like this.

  1. Start with the expected root element. (If there are multiple possible root elements, then in the general case you'll need to run this check for each of them.) If the expected root element is E, with type T1 in schema 1 and type T2 in schema 2, then place the task "Compare type T1 and T2" in a queue of open tasks. The list of tasks already completed will be empty.

  2. To compare two complex types T1 and T2:

    • Check the sets of attributes declared for T1 and T2 for a subset/superset relation between their names. Make sure no attribute required in the intended superset is absent or optional in the intended subset.

    • Each attribute A declared for both T1 and T2 will be assigned a type (call them ST1 and ST2). If ST1 = ST2, do nothing; otherwise, add the task "Compare simple types ST1 and ST2" to the queue of open tasks, unless it's on the list of comparisons already completed.

    • Now check the sequences of children that are possible in T1 and T2 -- as 13ren suggests in a comment, this is tractable since content models are essentially regular expressions which use the set of element names as their alphabet; the languages they define are therefore regular, and the subset/superset relation is decidable for regular languages.

    • Each possible child element C is assigned both an element declaration and a type definition by the parent types T1 and T2. Let us call them ED1, ED2, CT1, and CT2. Every child of the same name will have the same type, but different children may match different element declarations. So for any possible name, there will be just one pair of types CT1 and CT2, but there may be multiple pairs ED1 and ED2 (and the analysis will need to be careful to make sure they are matched up correctly; that might be hard to automate).

    • If CT1 = CT2, do nothing, otherwise put "Compare types CT1 and CT2" onto the open task queue, unless the comparison has already been performed.

    • If ED1 and ED2 are structurally identical, do nothing; otherwise put the task of comparing them into the task queue (unless it's already been done).

  3. To compare two simple types ST1 and ST2, compare either their lexical spaces (if you want the first definition of the subset/superset relation on schemas) or their value spaces (if you want the second). If ST1 and ST2 are both restrictions of the same primitive type, you may be able to compare the set of effective facet-based restrictions on them easily. The pattern facet may complicate matters, but because it defines a set of regular expressions, the subset/superset relation is decidable for it.

  4. To compare two element declarations, you need to compare each of the properties for the element declaration and check for the desired subset/superset relation.

As you can see, it's complex and tedious enough that you really want to automate this analysis, and it's also complex enough that it's easy to see why it's not widely offered as out-of-the-box function. But it would certainly be interesting to code.

like image 88
C. M. Sperberg-McQueen Avatar answered Sep 17 '22 14:09

C. M. Sperberg-McQueen