How to parse unicode strings with minidom?

Tags:

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

887

asked Mar 16 '11 18:03

dariopy

1 Answers

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.

If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

172

answered Oct 19 '22 09:10

bobince

Related questions
                            
                                AWS Lambda Container Running Selenium With Headless Chrome Works Locally But Not In AWS Lambda
                            
                                TypeError: '<' not supported between instances of 'function' and 'str'
                            
                                Pip is not working for Python 3.10 on Ubuntu
                            
                                Opening a handle to a device in Python on Windows
                            
                                How to write a functional test for a DBUS service written in Python?
                            
                                Daemonizing python's BaseHTTPServer
                            
                                C# way to mimic Python Dictionary Syntax
                            
                                Is the Python GIL really per interpreter?
                            
                                dict keys with spaces in Django templates
                            
                                How to parse/extract data from a mediawiki marked-up article via python
                            
                                Is there a Django ModelField that allows for multiple choices, aside from ManyToMany?
                            
                                django calendar free/busy/availabilitty
                            
                                Custom keys for Google App Engine models (Python)
                            
                                What is the difference between .get() and .fetch(1)
                            
                                Call Python From PHP And Get Return Code
                            
                                How to pickle a scapy packet?
                            
                                Best Way to Process a Word Document [closed]
                            
                                Workflow using virtualenv and pip
                            
                                Use psycopg2 to construct queries without connection
                            
                                Matplotlib does not display hatching when rendering to pdf

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parse unicode strings with minidom?

Tags:

python

unicode

minidom

dariopy

People also ask

1 Answers

bobince

Recent Activity

Donate For Us