Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml parsing with python: how to with objectify

I am trying to read xml behind an spss file, I would like to move from etree to objectify.

How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.

def get_etree(path_file):

    from lxml import etree

    with open(path_file, 'r+') as f:
        xml_text = f.read()     
    recovering_parser = etree.XMLParser(recover=True)    
    xml = etree.parse(StringIO(xml_text), parser=recovering_parser)

    return xml

my failed attempt:

def get_etree(path_file):

    from lxml import etree, objectify

    with open(path_file, 'r+') as f:
        xml_text = objectify.fromstring(xml)   

    return xml

but I get this error:

lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI
like image 967
Boosted_d16 Avatar asked Apr 07 '26 11:04

Boosted_d16


1 Answers

The first, biggest mistake is to read a file into a string and feed that string to an XML parser.

Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.

XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.

<?xml version="1.0" encoding="Windows-1252"?>

An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.

Luckily lxml makes it very easy:

from lxml import etree, objectify

def get_etree(path_file):
    return etree.parse(path_file, parser=etree.XMLParser(recover=True))

def get_objectify(path_file):
    return objectify.parse(path_file)

and

path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)

print xml1   # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2   # -> <lxml.etree._ElementTree object at 0x02A7B878>

P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?

I would do the latter. Using a recovering parser may cause nasty run-time errors later.

like image 158
Tomalak Avatar answered Apr 09 '26 02:04

Tomalak



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!