When whitespace is insignificant, representation may be very significant.
In XML Schema Part 2: Datatypes Second Edition the constraining facet whiteSpace is defined for types derived from string (http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace). If this whiteSpace facet is replace or collapse, the value may be changed during normalization.
There is a note at the end of Section 4.3.6:
The notation #xA used here (and elsewhere in this specification) represents the Universal Character Set (UCS) code point hexadecimal A (line feed), which is denoted by U+000A. This notation is to be distinguished from 
, which is the XML character reference to that same UCS code point.
If the datatype for an element elem has a whitespace constraint collapse, "<elem> text </elem>"
should become "text"
(leading and trailing whitespace removed), but "<elem> text </elem>"
should become " text "
(whitespace encoded by character reference not removed).
So either the parser/tree builder handles this normalization or this is done afterwards.
set_whitespace_normalization('./country/neighbor', 'collapse')
?normalize(content)
in the parser or tree builder?elem.original_text
, that may return " text 
"?elem.unnormalized_text
, that may return " text
"?I would like to use Python's xml.etree.ElementTree but I will consider any other XML library that does the job.
Of course it is bad style to declare whitespace insignificant (replace or collapse) and then to cheat by using character references. In most cases either the data or the schema should be changed to prevent that, but sometimes you have to work with foreign XML schemata and foreign XML documents. And the sheer existence of the note cited above indicates that the XML editors were aware of this dilemma and did deliberately not prevent it.
To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().
The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.
This appears to be a known bug in xml.etree.ElementTree: http://bugs.python.org/issue17582. According to that bug report, this is correctly handled in lxml.etree: https://pypi.python.org/pypi/lxml/.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With