I'm trying to sort my XML alphabetically while ensuring that a specific element stays at the top. I have managed to sort it alphabetically, but I cannot get that element to stay. Here is what I have so far:
from lxml import etree
data = """
<Example xmlns="http://www.example.org">
<E>
<A>A</A>
<B>B</B>
<C>C</C>
</E>
<B>B</B>
<D>D</D>
<A>A</A>
<C>C</C>
<F>F</F>
</Example>
"""
doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))
for parent in doc.xpath('//*[./*]'):
parent[:] = sorted(parent,key=lambda x: x.tag)
print etree.tostring(doc,pretty_print=True)
The result from this is:
<Example xmlns="http://www.example.org">
<A>A</A>
<B>B</B>
<C>C</C>
<D>D</D>
<E>
<A>A</A>
<B>B</B>
<C>1</C>
</E>
<F>F</F>
</Example>
Is there anyway I can stop the <E></E> part and its contents from moving?
You can handle this in at least 2 ways. You could sort everything, and then force <E> to the top through a custom sorting function. Also, you could split the elements to-be-sorted out, sort them, and append them to the end of the non-sorted elements.
Sorting for text occurs using progressive code points. You can get the code point for a single character using ord(). The lowest printed character is the tab. So for sorting we can tell python to sort all of the elements normally, unless the tag is <E>, then use a tab for sorting which will get sorted first.
There is some extra code to handle the namespace.
doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))
ns = doc.nsmap
for parent in doc.xpath('//*[./*]'):
parent[:] = sorted(parent,key=lambda x: x.tag if x.tag!='{'+ns[None]+'}E' else '\t')
print(etree.tostring(doc,pretty_print=True).decode('ascii'))
<Example xmlns="http://www.example.org">
<E>
<A>A</A>
<B>B</B>
<C>C</C>
</E>
<A>A</A>
<B>B</B>
<C>C</C>
<D>D</D>
<F>F</F>
</Example>
Here we split the parent into two lists, sort the second list, and then merge them.
doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))
ns = doc.nsmap
for parent in doc.xpath('//*[./*]'):
to_sort = (e for e in parent if e.tag!='{'+ns[None]+'}E')
non_sort = (e for e in parent if e.tag=='{'+ns[None]+'}E')
parent[:] = list(non_sort) + sorted(to_sort, key=lambda e: e.tag)
print(etree.tostring(doc,pretty_print=True).decode('ascii'))
<Example xmlns="http://www.example.org">
<E>
<A>A</A>
<B>B</B>
<C>C</C>
</E>
<A>A</A>
<B>B</B>
<C>C</C>
<D>D</D>
<F>F</F>
</Example>
It could work with the following way, but it seems the simple tag cannot be reached, so it uses the long tag, including the xmlns part :
doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))
for parent in doc.xpath('//*[./*]'):
parent[:] = sorted(parent,
key=lambda x: (not x.tag =='{http://www.example.org}E', x.tag))
print(etree.tounicode(doc,pretty_print=True))
This code will output :
<Example xmlns="http://www.example.org">
<E>
<A>A</A>
<B>B</B>
<C>C</C>
</E>
<A>A</A>
<B>B</B>
<C>C</C>
<D>D</D>
<F>F</F>
</Example>
</Example>\n'
The following code just outputs these long tags to understand what they look like :
doc = etree.XML(data,etree.XMLParser(remove_blank_text=True))
for parent in doc.xpath('//*[./*]'):
for item in parent:
print(item.tag)
{http://www.example.org}E
{http://www.example.org}B
{http://www.example.org}D
{http://www.example.org}A
{http://www.example.org}C
{http://www.example.org}F
{http://www.example.org}A
{http://www.example.org}B
{http://www.example.org}C
Another way is to use an helper function to parse the tag to make it more readable :
def normalize(name):
if name[0] == "{":
uri, tag = name[1:].split("}")
return tag
else:
return name
doc = etree.XML(data, etree.XMLParser(remove_blank_text=True))
for parent in doc.xpath('//*[./*]'):
parent[:] = sorted(parent,
key=lambda x: (not normalize(x.tag) == 'E', x.tag))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With