How do I handle whitespace with Python's elementtree?

Problem:

When whitespace is insignificant, representation may be very significant.

Explanation:

In XML Schema Part 2: Datatypes Second Edition the constraining facet whiteSpace is defined for types derived from string (http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace). If this whiteSpace facet is replace or collapse, the value may be changed during normalization.

There is a note at the end of Section 4.3.6:

The notation #xA used here (and elsewhere in this specification) represents the Universal Character Set (UCS) code point hexadecimal A (line feed), which is denoted by U+000A. This notation is to be distinguished from 
, which is the XML character reference to that same UCS code point.

Example:

If the datatype for an element elem has a whitespace constraint collapse, "<elem> text </elem>" should become "text" (leading and trailing whitespace removed), but "<elem> text </elem>" should become " text " (whitespace encoded by character reference not removed).

Questions:

So either the parser/tree builder handles this normalization or this is done afterwards.

Informed parsing:
- Where do I provide the parser or tree builder with the information on how to normalize some XML element?
- Is there something like set_whitespace_normalization('./country/neighbor', 'collapse')?
- Is there a hook like normalize(content) in the parser or tree builder?
Post processing
- How do I access the original content of some element?
- Is there a elem.original_text, that may return " text "?
- Is there a elem.unnormalized_text, that may return " text "?

I would like to use Python's xml.etree.ElementTree but I will consider any other XML library that does the job.

Disclaimer:

Of course it is bad style to declare whitespace insignificant (replace or collapse) and then to cheat by using character references. In most cases either the data or the schema should be changed to prevent that, but sometimes you have to work with foreign XML schemata and foreign XML documents. And the sheer existence of the note cited above indicates that the XML editors were aware of this dilemma and did deliberately not prevent it.

757

asked Jun 07 '13 01:06

Yurim

1 Answers

This appears to be a known bug in xml.etree.ElementTree: http://bugs.python.org/issue17582. According to that bug report, this is correctly handled in lxml.etree: https://pypi.python.org/pypi/lxml/.

172

answered Nov 01 '22 10:11

Mark Pundurs

Related questions
                            
                                python pexpect: SSHing then updating the date
                            
                                Why does sympy override `__new__` instead of `__init__`?
                            
                                How do I use test resources (like a fixed yaml file) with pytest?
                            
                                How to prebuffer an incoming network stream with gstreamer?
                            
                                csv & xlsx files import to pandas data frame: speed issue
                            
                                Is it possible to put sections inside container in reStructuredText?
                            
                                Semantics of turning list into string [duplicate]
                            
                                Using Celery for Realtime, Synchronous External API Querying with Gevent
                            
                                Random Forest interpretation in scikit-learn
                            
                                How to use Tornado with APScheduler?
                            
                                Check if items in a list exist in dictionary
                            
                                asynchronous subprocess with timeout
                            
                                urllib.quote_plus() equivalent in JavaScript
                            
                                Using Websocket in Pyramid using Python3
                            
                                Storing elements of one list, in another list - by reference - in Python?
                            
                                Interval average of 1D data
                            
                                Python Cx_Freeze error in Windows XP
                            
                                Python multiple inheritance of __new__ and __init__ with a string and second class
                            
                                Calculating Pi with decimal on Python
                            
                                What is the init function of a dynamic module in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I handle whitespace with Python's elementtree?

Tags:

python

xml

whitespace

xsd

elementtree