Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to format XML document in Linux

I have following XML tags in a large number.

<SERVICE>
<NAME>
sh_SEET15002GetReKeyDetails
</NAME>
<ID>642</ID>
</SERVICE>

I want to get this formatted in the following manner. I have tried using xmllint but it is not working for me. Please provide help.

<SERVICE>
<NAME>sh_SEET15002GetReKeyDetails</NAME>
<ID>642</ID>
</SERVICE>
like image 886
Aditya Avatar asked Jan 22 '14 06:01

Aditya


People also ask

How do I format an XML FILE?

To access XML formatting options, choose Tools > Options > Text Editor > XML, and then choose Formatting.

How do I indent an XML FILE in Linux?

4.2. The xml format command has four options to control the output: -n or –noindent: do not indent the output. -t or –indent-tab: indent output with TABs. -s or –indent-spaces <num>: indent output with <num> spaces.

Does Linux support XML?

XMLStarlet is installed by default on CentOS, Fedora, and many other modern Linux distributions, so just open a terminal and type xmlstarlet to access it. If XMLStarlet isn't already installed, your operating system offers to install it for you.


2 Answers

xmllint -format -recover nonformatted.xml > formated.xml

For tab indentation:

export XMLLINT_INDENT=`echo -e '\t'`

For four space indentation:

export XMLLINT_INDENT=\ \ \ \ 
like image 139
mturra Avatar answered Sep 23 '22 14:09

mturra


I do it from gedit. In gedit, you can add any script, in particular a Python script, as an External Tool. The script reads data from stdin and writes output to stdout, so it may be used as a stand-alone program. It layouts XML and sorts child nodes.

#!/usr/bin/env python
# encoding: utf-8

"""
This is a gedit plug-in to sort and layout XML.

In gedit, to add this tool, open: menu -- Tools -- Manage External Tools...
Create a new tool: click [+] under the list of tools, type in "Sort XML" as tool name,
paste the whole text from this file in the "Edit:" box, then 
configure the tool:
Input: Current selection
Output: Replace current selection

In gedit, to run this tool,
FIRST SELECT THE XML,
then open: menu -- Tools -- External Tools > -- Sort XML

"""


from lxml import etree
import sys
import io

def headerFirst(node):
    """Return the sorting key prefix, so that 'header' will go before any other node
    """
    nodetag=('%s' % node.tag).lower()
    if nodetag.endswith('}header') or nodetag == 'header':
        return '0'
    else:
        return '1'

def get_node_key(node, attr=None):
    """Return the sorting key of an xml node
    using tag and attributes
    """
    if attr is None:
        return '%s' % node.tag + ':'.join([node.get(attr)
                                        for attr in sorted(node.attrib)])
    if attr in node.attrib:
        return '%s:%s' % (node.tag, node.get(attr))
    return '%s' % node.tag


def sort_children(node, attr=None):
    """ Sort children along tag and given attribute.
    if attr is None, sort along all attributes"""
    if not isinstance(node.tag, str):  # PYTHON 2: use basestring instead
        # not a TAG, it is comment or DATA
        # no need to sort
        return
    # sort child along attr
    node[:] = sorted(node, key=lambda child: (headerFirst(child) + get_node_key(child, attr)))
    # and recurse
    for child in node:
        sort_children(child, attr)


def sort(unsorted_stream, sorted_stream, attr=None):
    """Sort unsorted xml file and save to sorted_file"""
    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.parse(unsorted_stream,parser=parser)
    root = tree.getroot()
    sort_children(root, attr)

    sorted_unicode = etree.tostring(tree, pretty_print=True, xml_declaration=True, encoding="UTF-8")

    sorted_stream.write('%s' % sorted_unicode)


#we could do this, 
#sort(sys.stdin, sys.stdout)
#but we want to check selection:

inputstr = ''
for line in sys.stdin:
  inputstr += line
if not inputstr:
   sys.stderr.write('no XML selected!')
   exit(100)

sort(io.BytesIO(inputstr), sys.stdout)

There are two tricky things:

    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.parse(unsorted_stream,parser=parser)

By default, the spaces are not ignored, which may produce a strange result.

    sorted_unicode = etree.tostring(tree, pretty_print=True, xml_declaration=True, encoding="UTF-8")

Again, by default there is no pretty-printing either.

I configure this tool to work on the current selection and replace the current selection because usually there are HTTP headers in the same file, YMMV.

$ python --version
Python 2.7.6

$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:   trusty

If you do not need child node sorting, just comment the corresponding line out.

Links: here, here

UPDATE v2 places header in front of anything else; fixed spaces

UPDATE getting lxml on Ubuntu 18.04.3 LTS bionic:

sudo apt install python-pip
pip install --upgrade lxml
$ python --version
Python 2.7.15+
like image 44
18446744073709551615 Avatar answered Sep 19 '22 14:09

18446744073709551615