Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse XML containing prefixes but no namespace declarations with lxml?

I have a bunch of XML files which are using prefixes but without the corresponding namespace declaration.

Stuff like:

<tal:block tal:condition="foo">
...
</tal:block>

or:

<div i18n:domain="my-app">
...

I know where those prefixes come from, an I tried the following, but without success:

from lxml import etree as ElementTree

ElementTree.register_namespace("i18n", "http://namespaces.zope.org")
ElementTree.register_namespace("tal", "http://xml.zope.org/namespaces/tal")

with open(path) as fp:
    tree = ElementTree.parse(fp)

but lxml still chokes with:

lxml.etree.XMLSyntaxError: Namespace prefix i18n for domain on div is not defined, line 4, column 20

I know I can use ElementTree.XMLParser(recover=True), but I would like to keep the prefix anyway, which this method don't.

Any idea?

like image 253
Jonathan Ballet Avatar asked May 01 '12 04:05

Jonathan Ballet


1 Answers

It's not valid XML, using undefined prefixes, so no XML parser is going to be able to deal with it.

Your best bet (other than fixing the XML) is to programmaticly modify the XML source to add the namespace attributes to the root element (just using the string support in your language). Add xmlns:tal="http://xml.zope.org/namespaces/tal", etc to the root element before you give the XML to the parser. Then the XML parser should handle it without complaint and without any registering namespaces.

like image 141
Francis Upton IV Avatar answered Oct 31 '22 01:10

Francis Upton IV