Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml: Get all leaf nodes?

Tags:

python

xml

lxml

Give an XML file, is there a way using lxml to get all the leaf nodes with their names and attributes?

Here is the XML file of interest:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
  <!-- This xml conforms to an XML Schema at:
    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
 and an XML DTD at:
    http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
  <id_info>
    <org_study_id>3370-2(-4)</org_study_id>
    <nct_id>NCT00753818</nct_id>
    <nct_alias>NCT00222157</nct_alias>
  </id_info>
  <brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
  <sponsors>
    <lead_sponsor>
      <agency>Mead Johnson Nutrition</agency>
      <agency_class>Industry</agency_class>
    </lead_sponsor>
  </sponsors>
  <source>Mead Johnson Nutrition</source>
  <oversight_info>
    <authority>United States: Institutional Review Board</authority>
  </oversight_info>
  <brief_summary>
    <textblock>
      The purpose of this study is to compare the effects on visual development, growth, cognitive
      development, tolerance, and blood chemistry parameters in term infants fed one of four study
      formulas containing various levels of DHA and ARA.
    </textblock>
  </brief_summary>
  <overall_status>Completed</overall_status>
  <phase>N/A</phase>
  <study_type>Interventional</study_type>
  <study_design>N/A</study_design>
  <primary_outcome>
    <measure>visual development</measure>
  </primary_outcome>
  <secondary_outcome>
    <measure>Cognitive development</measure>
  </secondary_outcome>
  <number_of_arms>4</number_of_arms>
  <condition>Cognitive Development</condition>
  <condition>Growth</condition>
  <arm_group>
    <arm_group_label>1</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>2</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>3</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>4</arm_group_label>
    <arm_group_type>Other</arm_group_type>
    <description>Control</description>
  </arm_group>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>DHA and ARA</intervention_name>
    <description>various levels of DHA and ARA</description>
    <arm_group_label>1</arm_group_label>
    <arm_group_label>2</arm_group_label>
    <arm_group_label>3</arm_group_label>
  </intervention>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>Control</intervention_name>
    <arm_group_label>4</arm_group_label>
  </intervention>
</clinical_study>

What I would like is a dictionary that looks like this:

{
   'id_info_org_study_id': '3370-2(-4)', 
   'id_info_nct_id': 'NCT00753818', 
   'id_info_nct_alias': 'NCT00222157', 
   'brief_title': 'Developmental Effects...'
}

Is this possible with lxml - or indeed any other Python library?

UPDATE:

I ended up doing it this way:

response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})

def _recurse_over_nodes(self, tree, parent_key, data):
    for branch in tree:
        key = branch.tag
        if branch.getchildren():
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            data = self._recurse_over_nodes(branch, key, data)
        else:
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            if key in data:
                data[key] = data[key] + ', %s' % branch.text
            else:
                data[key] = branch.text
    return data
like image 489
Richard Avatar asked Dec 05 '22 03:12

Richard


2 Answers

Use the iter method.

http://lxml.de/api/lxml.etree._Element-class.html#iter

Here is a functioning example.

#!/usr/bin/python
from lxml import etree

xml='''
<book>
    <chapter id="113">

        <sentence id="1" drums='Neil'>
            <word id="128160" bass='Geddy'>
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
        </sentence>

        <sentence id="2">
            <word id="128162">
                <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
        </sentence>

    </chapter>
</book>
'''

filename='/usr/share/sri/configurations/saved/test1.xml'

if __name__ == '__main__':
    root = etree.fromstring(xml)

    # iter will return every node in the document
    #
    for node in root.iter('*'):

        # nodes of length zero are leaf nodes
        #
        if 0 ==  len(node):
            print node

Here is the output:

$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>
like image 134
shrewmouse Avatar answered Dec 14 '22 12:12

shrewmouse


Supposed you have done getroot(), something simple like below can construct a dictionary with what you expected:

import lxml.etree

tree = lxml.etree.parse('sample_ctgov.xml')
root = tree.getroot()

d = {}
for node in root:
    key = node.tag
    if node.getchildren():
        for child in node:
            key += '_' + child.tag
            d.update({key: child.text})
    else:
        d.update({key: node.text})

Should do the trick, not optimised nor recursively hunt all children nodes, but you get the idea where to start.

like image 45
Anzel Avatar answered Dec 14 '22 11:12

Anzel