Define default namespace (unprefixed) in lxml

Question

When rendering XHTML with lxml, everything is fine, unless you happen to use Firefox, which seems unable to deal with namespace-prefixed XHTML elements and javascript. While Opera is able to execute the javascript (this applies to both jQuery and MathJax) fine, no matter whether the XHTML namespace has a prefix (h: in my case) or not, in Firefox the scripts will abort with weird errors (this.head is undefined in the case of MathJax).

I know about the register_namespace function, but it does neither accept None nor "" as namespace prefix. I've heard about _namespace_map in the lxml.etree module, but my Python complains that this attribute doesn't exist (version issues?)

Is there any other way removing the namespace prefix for the XHTML namespace? Note that str.replace, as suggested in the answer to another, related question, is not a method I could accept, as it is not aware of XML semantics and might easily screw up the resulting document.

As per request, you'll find two examples ready to use. One with namespace prefixes and one without. The first one will display 0 in Firefox (wrong) and the second one will display 1 (correct). Opera will render both correct. This is obviously a Firefox bug, but this only serves as a rationale for wanting prefixless XHTML with lxml – there are other good reasons as to reduce traffic for mobile clients etc (even h: is quite a lot if you consider tens or hundret of html tags).

neirbowj · Accepted Answer

Use ElementMaker and give it an nsmap that maps None to your default namespace.

#!/usr/bin/env python
# dogeml.py

from lxml.builder import ElementMaker
from lxml import etree

E = ElementMaker(
    nsmap={
        None: "http://wow/"    # <--- This is the special sauce
    }
)

doge = E.doge(
    E.such('markup'),
    E.many('very namespaced', syntax="tricks")
)

options = {
    'pretty_print': True,
    'xml_declaration': True,
    'encoding': 'UTF-8',
}

serialized_bytes = etree.tostring(doge, **options)
print(serialized_bytes.decode(options['encoding']))

As you can see in the output from this script, the default namespace is defined, but the tags do not have a prefix.

<?xml version='1.0' encoding='UTF-8'?>
<doge xmlns="http://wow/">
   <such>markup</such>
   <many syntax="tricks">very namespaced</many>
</doge>

I have tested this code with Python 2.7.6, 3.3.5, and 3.4.0, combined with lxml 3.3.1.

unutbu · Answer

This XSL transformation removes all prefixes from content, while maintaining namespaces defined in the root node:

import lxml.etree as ET

content = '''\
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<h:html xmlns:h="http://www.w3.org/1999/xhtml" xmlns:ml="http://foo">
  <h:head>
    <h:title>MathJax Test Page</h:title>
    <h:script type="text/javascript"><![CDATA[
      function test() {
        alert(document.getElementsByTagName("p").length);
      };
    ]]></h:script>
  </h:head>
  <h:body onload="test();">
    <h:p>test</h:p>
    <ml:foo></ml:foo>
  </h:body>
</h:html>
'''
dom = ET.fromstring(content)

xslt = '''\
<xsl:stylesheet version="1.0"
    xmlns="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<!-- identity transform for everything else -->
<xsl:template match="/|comment()|processing-instruction()|*|@*">
    <xsl:copy>
      <xsl:apply-templates />
    </xsl:copy>
</xsl:template>

<!-- remove NS from XHTML elements -->
<xsl:template match="*[namespace-uri() = 'http://www.w3.org/1999/xhtml']">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()" />
    </xsl:element>
</xsl:template>

<!-- remove NS from XHTML attributes -->
<xsl:template match="@*[namespace-uri() = 'http://www.w3.org/1999/xhtml']">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="." />
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc = ET.fromstring(xslt)
transform = ET.XSLT(xslt_doc)
dom = transform(dom)

print(ET.tostring(dom, pretty_print = True, 
                  encoding = 'utf-8'))

yields

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>MathJax Test Page</title>
    <script type="text/javascript">
      function test() {
        alert(document.getElementsByTagName("p").length);
      };
    </script>
  </head>
  <body onload="test();">
    <p>test</p>
    <ml:foo xmlns:ml="http://foo"/>
  </body>
</html>

Define default namespace (unprefixed) in lxml

Tags:

python

namespaces

xhtml

xslt

lxml

Jonas Schäfer

2 Answers

neirbowj

unutbu

Recent Activity

Donate For Us

Define default namespace (unprefixed) in lxml

Tags:

python

namespaces

xhtml

xslt

lxml

Jonas Schäfer

2 Answers

neirbowj

unutbu

Related questions

Recent Activity

Donate For Us