Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: parse an XML in Windows-1251 encoding

When I try to parse XML with lxml like this:

tree = etree.parse('xml.xml')

I get the following error:

lxml.etree.XMLSyntaxError: Unsupported encoding windows-1251

How can I read data from an XML with this encoding?

Thank you

like image 490
Alex Avatar asked Apr 27 '11 16:04

Alex


1 Answers

I think you use a Python 2.x version.

If so, I believe that you must use the open() function of codecs module, and to do:

import codecs
with codecs.open(filename,'rb','cp1251') as f:
    content = f.read()
    tree = etree.parse(content)

I think that the obtained content has been decoded from cp1251 to Unicode; I am not sure, I am not skilled in Unicode manipulations.

If so, I suppose that, after the reading, etree must be able to parse a string in Unicode to continue. But I know etree a little too.

Note that even if mode was 'r', codecs.open() always opens a file in binary mode.

Hope that will help

like image 143
eyquem Avatar answered Oct 16 '22 14:10

eyquem