Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find elements based on xsd type with lxml

Tags:

python

xml

lxml

xsd

I am trying to get a list of elements with a specific xsd type with lxml 2.x and I can't figure out how to traverse the xsd for specific types.

Example of schema:

<xsd:element name="ServerOwner" type="srvrs:string90" minOccurs="0">
<xsd:element name="HostName" type="srvrs:string35" minOccurs="0">

Example xml data:

<srvrs:ServerOwner>John Doe</srvrs:ServerOwner>
<srvrs:HostName>box01.example.com</srvrs:HostName>

The ideal function would look like:

    elements = getElems(xml_doc, 'string90')

    def getElems(xml_doc, xsd_type):
      ** xpath or something to find the elements and build a dict
      return elements
like image 526
joet3ch Avatar asked Mar 30 '10 02:03

joet3ch


1 Answers

Really the only special support lxml has for XML Schema, as seen here, is to tell you if some document is valid according to some schema or not. Anything more sophisticated you'll have to do yourself.

This should be a relatively simple two-phase process, I'd think -- get all the xsd:element elements in the schema that match the type you care about, and look at their names:

def getElems(schemaDoc, xmlDoc, typeName):
    names = schemaDoc.xpath("//xsd:element[@type = $n]/@name",
                            namespaces={"xsd": 
                                        "http://www.w3.org/2001/XMLSchema"},
                            n=typeName)

Then, fetch all the elements with each name from the document.

    elements = []
    for name in names: 
        namedElements = xmlDoc.xpath("//*[local-name() = $name]", name=name)
        elements.extend(namedElements)

Now you have a list of elements with the names that matched the type in the schema.

    return elements

Note that the xpath expression for searching the document has to look at every element, so if you can tighten that up to only look in the subsection of the document you care about it'll go faster.

like image 129
Allen Avatar answered Oct 01 '22 09:10

Allen