Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse strings representing xml.dom.minidom nodes in python?

Tags:

python

xml

I have a collection of nodes xml.dom.Node objects created using xml.dom.minidom. I store them (individually) in a database by converting them to a string using the toxml() method of a the Node object.

The problem is that I'd sometimes like to be able to convert them back to the appropriate Node object using a parser of some kind. As far as I can see the various libraries shipped with python use Expat which won't parse a string like '' or indeed anything which is not a correct xml string.

So, does anyone have any ideas? I realise I could pickle the nodes in some way and then unpickle them, but that feels unpleasant and I'd much rather be storing in a form I can read for maintenance purposes. Surely there is something that will do this?

In response to the doubt expressed that this is possible, an example of what I mean:

>>> import xml.dom.minidom
>>> x=xml.dom.minidom.parseString('<a>foo<b>thing</b></a>')
>>> x.documentElement.childNodes[0]
<DOM Text node "u'foo'">
>>> x.documentElement.childNodes[0].toxml()
u'foo'
>>> xml.dom.minidom.parseString(x.documentElement.childNodes[0].toxml())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0

In other words the ".toxml()" method does not create something that Expat (and hence out of the box parseString) will parse.

What I would like is something that will parse u'foo' into a text node. I.e. something that will reverse the effect of .toxml()

like image 842
Francis Davey Avatar asked Jan 22 '23 00:01

Francis Davey


2 Answers

from xml.dom.minidom import parseString

try:
  node = parseString('') 
except Exception:
  node = None
like image 135
Tomalak Avatar answered Jan 29 '23 15:01

Tomalak


What types of node do you need to store?

Obviously Element nodes should just work if serialised with .toxml('utf-8'); the results should be parseable as an XML document as-is and the element retrievable from documentElement, as long as there are no EntityReferences inside it that would need definition in the doctype.

Text nodes, on the other hand, would need either HTML-decoding or some wrapping to parse. If you only needed elements and text nodes you could guess whether it was an element from the first character, since that must always be < for an element:

var xml= node.toxml('utf-8')

...

if (xml.startswith('<')):
    node= minidom.parseString(xml).documentElement
else:
    node= minidom.parseString('<x>%s</x>'%xml).documentElement.firstChild

Comment nodes could similarly be stored by checking for <!--.

Other node types like Attr would be more work since their XML representation is not easily distinguishable from Text. You would probably need to store an out-of-band nodeType value to remember it. OTOH minidom doesn't implement toxml() on Attr anyway so maybe that's not an issue.

like image 34
bobince Avatar answered Jan 29 '23 16:01

bobince