Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bash, Remove empty XML tags

Tags:

linux

bash

xml

sed

I need some help a couple of questions, using bash tools

  1. I want to remove empty xml tags from a file eg:
 <CreateOfficeCode>
      <OperatorId>ve</OperatorId>
      <OfficeCode>1234</OfficeCode>
      <CountryCodeLength>0</CountryCodeLength>
      <AreaCodeLength>3</AreaCodeLength>
      <Attributes></Attributes>
      <ChargeArea></ChargeArea>
 </CreateOfficeCode>

to become:

 <CreateOfficeCode>
      <OperatorId>ve</OperatorId>
      <OfficeCode>1234</OfficeCode>
      <CountryCodeLength>0</CountryCodeLength>
      <AreaCodeLength>3</AreaCodeLength>
 </CreateOfficeCode>

for this I have done so by this command

sed -i '/><\//d' file

which is not so strict, its more like a trick, something more appropriate would be to find the <pattern></pattern> and remove it. Suggestion?

  1. Second, how to go from:
 <CreateOfficeGroup>
       <CreateOfficeName>John</CreateOfficeName>
       <CreateOfficeCode>
       </CreateOfficeCode>
 </CreateOfficeGroup>

to:

 <CreateOfficeGroup>
       <CreateOfficeName>John</CreateOfficeName>
 </CreateOfficeGroup>
  1. As a whole thing? from:
 <CreateOfficeGroup>
       <CreateOfficeName>John</CreateOfficeName>
       <CreateOfficeCode>
            <OperatorId>ve</OperatorId>
            <OfficeCode>1234</OfficeCode>
            <CountryCodeLength>0</CountryCodeLength>
            <AreaCodeLength>3</AreaCodeLength>
            <Attributes></Attributes>
            <ChargeArea></ChargeArea>
       </CreateOfficeCode>
       <CreateOfficeSize>
            <Chairs></Chairs>
            <Tables></Tables>
       </CreateOfficeSize>
 </CreateOfficeGroup>

to:

 <CreateOfficeGroup>
       <CreateOfficeName>John</CreateOfficeName>
       <CreateOfficeCode>
            <OperatorId>ve</OperatorId>
            <OfficeCode>1234</OfficeCode>
            <CountryCodeLength>0</CountryCodeLength>
            <AreaCodeLength>3</AreaCodeLength>
       </CreateOfficeCode>
 </CreateOfficeGroup>

Can you answer the questions as individuals? Thank you very much!

like image 395
thahgr Avatar asked Nov 04 '14 12:11

thahgr


2 Answers

XMLStarlet is a command-line XML processor. Doing what you want with it is a one-line operation (until the desired recursive behavior is added), and will work for all variants of XML syntax describing the same input:

The simple version:

xmlstarlet ed \
  -d '//*[not(./*) and (not(./text()) or normalize-space(./text())="")]' \
  input.xml

The fancy version:

strip_recursively() {
  local doc last_doc
  IFS= read -r -d '' doc 
  while :; do
    last_doc=$doc
    doc=$(xmlstarlet ed \
           -d '//*[not(./*) and (not(./text()) or normalize-space(./text())="")]' \
           /dev/stdin <<<"$last_doc")
    if [[ $doc = "$last_doc" ]]; then
      printf '%s\n' "$doc"
      return
    fi
  done
}
strip_recursively <input.xml

/dev/stdin is used rather than - (at some cost to platform portability) for better portability across releases of XMLStarlet; adjust to taste.


With a system having only older dependencies installed, a more likely XML parser to have installed is that bundled with Python.

#!/usr/bin/env python

import xml.etree.ElementTree as etree
import sys

doc = etree.parse(sys.stdin)
def prune(parent):
    ever_changed = False
    while True:
        changed = False
        for el in parent.getchildren():
            if len(el.getchildren()) == 0:
                if ((el.text is None or el.text.strip() == '') and
                    (el.tail is None or el.tail.strip() == '')):
                    parent.remove(el)
                    changed = True
            else:
                changed = changed or prune(el)
        ever_changed = changed or ever_changed
        if changed is False:
            return ever_changed

prune(doc.getroot())
print etree.tostring(doc.getroot())
like image 145
Charles Duffy Avatar answered Sep 20 '22 01:09

Charles Duffy


sed '#n
1h;1!H
$ { x
:remtag
  s#\(\n* *\)*<\([^>]*>\)\( *\n*\)*</\2##g
  t remtag

  p
  }' YourFile

(posix version so --posix on GNU sed)

  • recursively remove empty tag from lower lever to upper one until no more empty tag occur.
  • Not a XML parser so something like <tag1 prop="<tag2></tag2>"> ... will remove the prop content also and any other thing like that that xml allow.
like image 22
NeronLeVelu Avatar answered Sep 19 '22 01:09

NeronLeVelu