Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I handle whitespace with Python's elementtree?

Problem:

When whitespace is insignificant, representation may be very significant.

Explanation:

In XML Schema Part 2: Datatypes Second Edition the constraining facet whiteSpace is defined for types derived from string (http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace). If this whiteSpace facet is replace or collapse, the value may be changed during normalization.

There is a note at the end of Section 4.3.6:

The notation #xA used here (and elsewhere in this specification) represents the Universal Character Set (UCS) code point hexadecimal A (line feed), which is denoted by U+000A. This notation is to be distinguished from 
, which is the XML character reference to that same UCS code point.

Example:

If the datatype for an element elem has a whitespace constraint collapse, "<elem> text </elem>" should become "text" (leading and trailing whitespace removed), but "<elem>&#x20;text&#x20;</elem>" should become " text " (whitespace encoded by character reference not removed).

Questions:

So either the parser/tree builder handles this normalization or this is done afterwards.

  • Informed parsing:
    • Where do I provide the parser or tree builder with the information on how to normalize some XML element?
    • Is there something like set_whitespace_normalization('./country/neighbor', 'collapse')?
    • Is there a hook like normalize(content) in the parser or tree builder?
  • Post processing
    • How do I access the original content of some element?
    • Is there a elem.original_text, that may return "&#x20;text&#x20;"?
    • Is there a elem.unnormalized_text, that may return " text "?

I would like to use Python's xml.etree.ElementTree but I will consider any other XML library that does the job.

Disclaimer:

Of course it is bad style to declare whitespace insignificant (replace or collapse) and then to cheat by using character references. In most cases either the data or the schema should be changed to prevent that, but sometimes you have to work with foreign XML schemata and foreign XML documents. And the sheer existence of the note cited above indicates that the XML editors were aware of this dilemma and did deliberately not prevent it.

like image 757
Yurim Avatar asked Jun 07 '13 01:06

Yurim


People also ask

How do you process XML in Python?

To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().

What is xml etree ElementTree?

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.


1 Answers

This appears to be a known bug in xml.etree.ElementTree: http://bugs.python.org/issue17582. According to that bug report, this is correctly handled in lxml.etree: https://pypi.python.org/pypi/lxml/.

like image 172
Mark Pundurs Avatar answered Nov 01 '22 10:11

Mark Pundurs