Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Unicode and ElementTree.parse

I'm trying to move to Python 2.7 and since Unicode is a Big Deal there, I'd try dealing with them with XML files and texts and parse them using the xml.etree.cElementTree library. But I ran across this error:

>>> import xml.etree.cElementTree as ET
>>> from io import StringIO
>>> source = """\
... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
... <root>
...   <Parent>
...     <Child>
...       <Element>Text</Element>
...     </Child>
...   </Parent>
... </root>
... """
>>> srcbuf = StringIO(source.decode('utf-8'))
>>> doc = ET.parse(srcbuf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

The same thing happens using io.open('filename.xml', encoding='utf-8') to pass to ET.parse:

>>> with io.open('test.xml', mode='w', encoding='utf-8') as fp:
...     fp.write(source.decode('utf-8'))
...
150L
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     fp.read()
...
u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>\n<root>\n  <Parent>\n
    <Child>\n      <Element>Text</Element>\n    </Child>\n  </Parent>\n</root>\n
'
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     ET.parse(fp)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

Is there something about unicode and ET parsing that I am missing here?

edit: Apparently, the ET parser does not play well with unicode input stream? The following works:

>>> with io.open('test.xml', mode='rb') as fp:
...     ET.parse(fp)
...
<ElementTree object at 0x0180BC10>

But this also means I cannot use io.StringIO if I want to parse from an in-memory text, unless I encode it first into an in-memory buffer?

like image 884
Santa Avatar asked Aug 05 '10 19:08

Santa


2 Answers

Can't you use

doc = ET.fromstring(source)

in your first example ?

like image 139
Andre Holzner Avatar answered Sep 28 '22 09:09

Andre Holzner


I encountered the same problem as you in Python 2.6.

It seems that "utf-8" encoding for cElementTree.parse in Python 2.x and 3.x version are different. In Python 2.x, we can use XMLParser to encode the unicode. For example:

import xml.etree.cElementTree as etree

parser = etree.XMLParser(encoding="utf-8")
targetTree = etree.parse( "./targetPageID.xml", parser=parser )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

You can refer to this page for the XMLParser method (Section "XMLParser"): http://effbot.org/zone/elementtree-13-intro.htm

While the following method works for Python 3.x version:

import xml.etree.cElementTree as etree
import codecs

target_file = codecs.open("./targetPageID.xml",mode='r',encoding='utf-8')

targetTree = etree.parse( target_file )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

Hope this can help you.

like image 44
Xiangju Avatar answered Sep 28 '22 08:09

Xiangju