I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?
minidom is a minimal implementation of the Document Object Model interface, with an API similar to that in other languages. It is intended to be simpler than the full DOM and also significantly smaller. Users who are not already proficient with the DOM should consider using the xml.
toprettyxml. n.toprettyxml(indent='\t',newl='\n') Returns a string, plain or Unicode, with the XML source for the subtree rooted at n, using indent to indent nested tags and newl to end lines. toxml. n.toxml( )
The DOM is a standard tree representation for XML data. The Document Object Model is being defined by the W3C in stages, or “levels” in their terminology. The Python mapping of the API is substantially based on the DOM Level 2 recommendation. DOM applications typically start by parsing some XML into a DOM.
Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString()
, or just use parse()
which will read a file directly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With