Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove namespace and prefix from xml in python using lxml

I have an xml file I need to open and make some changes to, one of those changes is to remove the namespace and prefix and then save to another file. Here is the xml:

<?xml version='1.0' encoding='UTF-8'?> <package xmlns="http://apple.com/itunes/importer">   <provider>some data</provider>   <language>en-GB</language> </package> 

I can make the other changes I need, but can't find out how to remove the namespace and prefix. This is the reusklt xml I need:

<?xml version='1.0' encoding='UTF-8'?> <package>   <provider>some data</provider>   <language>en-GB</language> </package> 

And here is my script which will open and parse the xml and save it:

metadata = '/Users/user1/Desktop/Python/metadata.xml' from lxml import etree parser = etree.XMLParser(remove_blank_text=True) open(metadata) tree = etree.parse(metadata, parser) root = tree.getroot() tree.write('/Users/user1/Desktop/Python/done.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8') 

So how would I add code in my script which will remove the namespace and prefix?

like image 599
speedyrazor Avatar asked Aug 10 '13 06:08

speedyrazor


People also ask

What does LXML do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

What is namespace in @xmlelement?

XML namespaces provide a method for qualifying the names of XML elements and XML attributes in XML documents. A qualified name consists of a prefix and a local name, separated by a colon. The prefix functions only as a placeholder; it is mapped to a URI that specifies a namespace.


2 Answers

We can get the desired output document in two steps:

  1. Remove namespace URIs from element names
  2. Remove unused namespace declarations from the XML tree

Example code

from lxml import etree  input_xml = """ <package xmlns="http://apple.com/itunes/importer">   <provider>some data</provider>   <language>en-GB</language>   <!-- some comment -->   <?xml-some-processing-instruction ?> </package> """ root = etree.fromstring(input_xml)  # Iterate through all XML elements for elem in root.getiterator():     # Skip comments and processing instructions,     # because they do not have names     if not (         isinstance(elem, etree._Comment)         or isinstance(elem, etree._ProcessingInstruction)     ):         # Remove a namespace URI in the element's name         elem.tag = etree.QName(elem).localname  # Remove unused namespace declarations etree.cleanup_namespaces(root)  print(etree.tostring(root).decode()) 

Output XML

<package>   <provider>some data</provider>   <language>en-GB</language>   <!-- some comment -->   <?xml-some-processing-instruction ?> </package> 

Details explaining the code

As described in the documentation, we use lxml.etree.QName.localname to get local names of elements, that is names without namespace URIs. Then we replace the fully qualified names of the elements by their local names.

Some XML elements, such as comments and processing instructions do not have names. So, we have to skip these elements while replacing element names, otherwise a ValueError will be raised.

Finally, we use lxml.etree.cleanup_namespaces() to remove unused namespace declarations from the XML tree.

like image 87
SergiyKolesnikov Avatar answered Sep 28 '22 04:09

SergiyKolesnikov


Replace tag as Uku Loskit suggests. In addition to that, use lxml.objectify.deannotate.

from lxml import etree, objectify  metadata = '/Users/user1/Desktop/Python/metadata.xml' parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(metadata, parser) root = tree.getroot()  ####     for elem in root.getiterator():     if not hasattr(elem.tag, 'find'): continue  # guard for Comment tags     i = elem.tag.find('}')     if i >= 0:         elem.tag = elem.tag[i+1:] objectify.deannotate(root, cleanup_namespaces=True) ####  tree.write('/Users/user1/Desktop/Python/done.xml',            pretty_print=True, xml_declaration=True, encoding='UTF-8') 

Note: Some tags like Comment return a function when accessing tag attribute. added a guard for that.

like image 29
falsetru Avatar answered Sep 28 '22 05:09

falsetru