I have an XML file which looks like this: <pre class="prettyprint"><code><encspot> <file> <Name>some filename.mp3</Name> <Encoder>Gogo (after 3.0)</Encoder> <Bitrate>131</Bitrate> <Mode>joint stereo</Mode> <Length>00:02:43</Length> <Size>5,236,644</Size> <Frame>no</Frame> <Quality>good</Quality> <Freq.>44100</Freq.> <Frames>6255</Frames> ..... and so forth ...... </file> <file>....</file> </encspot> </code></pre> I want to read it into a python object, something like a list of dictionaries. Because the markup is absolutely fixed, I'm tempted to use regex (I'm quite good at using those). However, I thought I'll check if someone knows how to easily avoid regexes here. I don't have much experience with SAX or other parsing, though, but I'm willing to learn. I'm looking forward to be shown how this is done quickly without regexes in Python. Thanks for your help!

My beloved SD Chargers hat is off to you if you think a regex is easier than this: <pre class="prettyprint"><code>#!/usr/bin/env python import xml.etree.cElementTree as et sxml=""" <encspot> <file> <Name>some filename.mp3</Name> <Encoder>Gogo (after 3.0)</Encoder> <Bitrate>131</Bitrate> </file> <file> <Name>another filename.mp3</Name> <Encoder>iTunes</Encoder> <Bitrate>128</Bitrate> </file> </encspot> """ tree=et.fromstring(sxml) for el in tree.findall('file'): print '-------------------' for ch in el.getchildren(): print '{:>15}: {:<30}'.format(ch.tag, ch.text) print "\nan alternate way:" el=tree.find('file[2]/Name') # xpath print '{:>15}: {:<30}'.format(el.tag, el.text) </code></pre> Output: <pre class="prettyprint"><code>------------------- Name: some filename.mp3 Encoder: Gogo (after 3.0) Bitrate: 131 ------------------- Name: another filename.mp3 Encoder: iTunes Bitrate: 128 an alternate way: Name: another filename.mp3 </code></pre> If your attraction to a regex is being terse, here is an equally incomprehensible bit of list comprehension to create a data structure: <pre class="prettyprint"><code>[(ch.tag,ch.text) for e in tree.findall('file') for ch in e.getchildren()] </code></pre> Which creates a list of tuples of the XML children of <code><file></code> in document order: <pre class="prettyprint"><code>[('Name', 'some filename.mp3'), ('Encoder', 'Gogo (after 3.0)'), ('Bitrate', '131'), ('Name', 'another filename.mp3'), ('Encoder', 'iTunes'), ('Bitrate', '128')] </code></pre> With a few more lines and a little more thought, obviously, you can create any data structure that you want from XML with ElementTree. It is part of the Python distribution. Edit Code golf is on! <pre class="prettyprint"><code>[{item.tag: item.text for item in ch} for ch in tree.findall('file')] [ {'Bitrate': '131', 'Name': 'some filename.mp3', 'Encoder': 'Gogo (after 3.0)'}, {'Bitrate': '128', 'Name': 'another filename.mp3', 'Encoder': 'iTunes'}] </code></pre> If your XML only has the <code>file</code> section, you can choose your golf. If your XML has other tags, other sections, you need to account for the section the children are in and you will need to use <code>findall</code> There is a tutorial on ElementTree at Effbot.org

Use ElementTree. You don't need/want to muck about with a parse-only gadget like <code>pyexpat</code> ... you'd only end up re-inventing ElementTree partially and poorly. Another possibility is lxml which is a third-party package which implements the ElementTree interface plus more. Update Someone started playing code-golf; here's my entry, which actually creates the data structure you asked for: <pre class="prettyprint"><code># xs = """<encspot> etc etc </encspot""" >>> import xml.etree.cElementTree as et >>> from pprint import pprint as pp >>> pp([dict((attr.tag, attr.text) for attr in el) for el in et.fromstring(xs)]) [{'Bitrate': '131', 'Encoder': 'Gogo (after 3.0)', 'Frame': 'no', 'Frames': '6255', 'Freq.': '44100', 'Length': '00:02:43', 'Mode': 'joint stereo', 'Name': 'some filename.mp3', 'Quality': 'good', 'Size': '5,236,644'}, {'Bitrate': '0', 'Name': 'foo.mp3'}] >>> </code></pre> You'd probably want to have a dict mapping "attribute" names to conversion functions: <pre class="prettyprint"><code>converters = { 'Frames': int, 'Size': lambda x: int(x.replace(',', '')), # etc } </code></pre>

Parse XML file into Python object

Tags:

I have an XML file which looks like this:

<encspot>
  <file>
   <Name>some filename.mp3</Name>
   <Encoder>Gogo (after 3.0)</Encoder>
   <Bitrate>131</Bitrate>
   <Mode>joint stereo</Mode>
   <Length>00:02:43</Length>
   <Size>5,236,644</Size>
   <Frame>no</Frame>
   <Quality>good</Quality>
   <Freq.>44100</Freq.>
   <Frames>6255</Frames>
   ..... and so forth ......
  </file>
  <file>....</file>
</encspot>

I want to read it into a python object, something like a list of dictionaries. Because the markup is absolutely fixed, I'm tempted to use regex (I'm quite good at using those). However, I thought I'll check if someone knows how to easily avoid regexes here. I don't have much experience with SAX or other parsing, though, but I'm willing to learn.

I'm looking forward to be shown how this is done quickly without regexes in Python. Thanks for your help!

728

asked Apr 03 '11 16:04

Felix Dombek

2 Answers

My beloved SD Chargers hat is off to you if you think a regex is easier than this:

#!/usr/bin/env python
import xml.etree.cElementTree as et

sxml="""
<encspot>
  <file>
   <Name>some filename.mp3</Name>
   <Encoder>Gogo (after 3.0)</Encoder>
   <Bitrate>131</Bitrate>
  </file>
  <file>
   <Name>another filename.mp3</Name>
   <Encoder>iTunes</Encoder>
   <Bitrate>128</Bitrate>  
  </file>
</encspot>
"""
tree=et.fromstring(sxml)

for el in tree.findall('file'):
    print '-------------------'
    for ch in el.getchildren():
        print '{:>15}: {:<30}'.format(ch.tag, ch.text) 

print "\nan alternate way:"  
el=tree.find('file[2]/Name')  # xpath
print '{:>15}: {:<30}'.format(el.tag, el.text)

Output:

-------------------
           Name: some filename.mp3             
        Encoder: Gogo (after 3.0)              
        Bitrate: 131                           
-------------------
           Name: another filename.mp3          
        Encoder: iTunes                        
        Bitrate: 128                           

an alternate way:
           Name: another filename.mp3

If your attraction to a regex is being terse, here is an equally incomprehensible bit of list comprehension to create a data structure:

[(ch.tag,ch.text) for e in tree.findall('file') for ch in e.getchildren()]

Which creates a list of tuples of the XML children of <file> in document order:

[('Name', 'some filename.mp3'), 
 ('Encoder', 'Gogo (after 3.0)'), 
 ('Bitrate', '131'), 
 ('Name', 'another filename.mp3'), 
 ('Encoder', 'iTunes'), 
 ('Bitrate', '128')]

With a few more lines and a little more thought, obviously, you can create any data structure that you want from XML with ElementTree. It is part of the Python distribution.

Edit

Code golf is on!

[{item.tag: item.text for item in ch} for ch in tree.findall('file')] 
[ {'Bitrate': '131', 
   'Name': 'some filename.mp3', 
   'Encoder': 'Gogo (after 3.0)'}, 
  {'Bitrate': '128', 
   'Name': 'another filename.mp3', 
   'Encoder': 'iTunes'}]

If your XML only has the file section, you can choose your golf. If your XML has other tags, other sections, you need to account for the section the children are in and you will need to use findall

There is a tutorial on ElementTree at Effbot.org

172

answered Sep 28 '22 16:09

the wolf

Use ElementTree. You don't need/want to muck about with a parse-only gadget like pyexpat ... you'd only end up re-inventing ElementTree partially and poorly.

Another possibility is lxml which is a third-party package which implements the ElementTree interface plus more.

Update Someone started playing code-golf; here's my entry, which actually creates the data structure you asked for:

# xs = """<encspot> etc etc </encspot"""
>>> import xml.etree.cElementTree as et
>>> from pprint import pprint as pp
>>> pp([dict((attr.tag, attr.text) for attr in el) for el in et.fromstring(xs)])
[{'Bitrate': '131',
  'Encoder': 'Gogo (after 3.0)',
  'Frame': 'no',
  'Frames': '6255',
  'Freq.': '44100',
  'Length': '00:02:43',
  'Mode': 'joint stereo',
  'Name': 'some filename.mp3',
  'Quality': 'good',
  'Size': '5,236,644'},
 {'Bitrate': '0', 'Name': 'foo.mp3'}]
>>>

You'd probably want to have a dict mapping "attribute" names to conversion functions:

converters = {
    'Frames': int,
    'Size': lambda x: int(x.replace(',', '')),
    # etc
    }

answered Sep 28 '22 15:09

John Machin

Related questions
                            
                                Is it wise to access read-only data from multiple threads simultaneously?
                            
                                Java: Filling a BufferedImage with transparent pixels
                            
                                SQL Between clause with strings columns
                            
                                javascript: pass function as a parameter to another function, the code gets called in another order then i expect
                            
                                iOS unit test: How to set/update/examine firstResponder?
                            
                                Why string==string comparison failing?
                            
                                django-taggit: make the tags not required in the admin
                            
                                Assign Null value to the Integer Column in the DataTable
                            
                                Tornado login Examples/Tutorials [closed]
                            
                                How can I disable the ENTER key on my textarea?
                            
                                Adding arrays to multi-dimensional array within loop
                            
                                Why should I not check bin and obj folders in, in SVN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With