How to remove unwanted tags from XML

Question

I have a huge XML and I want to remove unwanted tags from this. Ex.'

<orgs>
    <org name="Test1">
        <item>a</item>
        <item>b</item>
    </org>
    <org name="Test2">
        <item>c</item>
        <item>b</item>
        <item>e</item>
    </org>
</orgs>

I want to remove all the <item>b</item> from this xml. Which parser api should be use for this as xml is very large and How can achieve it.

MadProgrammer · Accepted Answer

One approach would be to use a Document Object Model (DOM), the draw back to this, as the name suggests, it needs to load the entire document into memory and Java's DOM API is quite memory hungry. The benefit is, you can take advantage of XPath to find the offending nodes

Take a closer look at Java API for XML Processing (JAXP) for more details and other APIs

Step: 1 Load the document

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new File("..."));

Set 2: Find the offending nodes

XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xExpress = xPath.compile("/orgs/org/item[text()='b']");
NodeList nodeList = (NodeList) xExpress.evaluate(doc.getDocumentElement(), XPathConstants.NODESET);

Set 3: Remove offending nodes

Okay, this is not as simple as it should be. Removing a node can leave a blank space in the document, which would be "nice" to clean up. The following method is a simple library method I adapted from some internet code(s) I found, which will remove the specified Node, but will also remove any white space/text nodes as well

public static void removeNode(Node node) {
    if (node != null) {
        while (node.hasChildNodes()) {
            removeNode(node.getFirstChild());
        }

        Node parent = node.getParentNode();
        if (parent != null) {
            parent.removeChild(node);
            NodeList childNodes = parent.getChildNodes();
            if (childNodes.getLength() > 0) {
                List<Node> lstTextNodes = new ArrayList<Node>(childNodes.getLength());
                for (int index = 0; index < childNodes.getLength(); index++) {
                    Node childNode = childNodes.item(index);
                    if (childNode.getNodeType() == Node.TEXT_NODE) {
                        lstTextNodes.add(childNode);
                    }
                }
                for (Node txtNodes : lstTextNodes) {
                    removeNode(txtNodes);
                }
            }
        }
    }
}

Loop over the offending nodes...

for (int index = 0; index < nodeList.getLength(); index++) {
    Node node = nodeList.item(index);
    removeNode(node);
}

Step 4: Save the result

Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.setOutputProperty(OutputKeys.METHOD, "xml");
tf.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");

DOMSource domSource = new DOMSource(doc);
StreamResult sr = new StreamResult(System.out);
tf.transform(domSource, sr);

Which outputs something like...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<orgs>
  <org name="Test1">
    <item>a</item>
  </org>
  <org name="Test2">
    <item>c</item>
    <item>e</item>
  </org>
</orgs>

How to remove unwanted tags from XML

Tags:

java

xml

xml-parsing

Sai prateek

1 Answers

Step: 1 Load the document

Set 2: Find the offending nodes

Set 3: Remove offending nodes

Step 4: Save the result

MadProgrammer

Recent Activity

Donate For Us

How to remove unwanted tags from XML

Tags:

java

xml

xml-parsing

Sai prateek

1 Answers

Step: 1 Load the document

Set 2: Find the offending nodes

Set 3: Remove offending nodes

Step 4: Save the result

MadProgrammer

Related questions

Recent Activity

Donate For Us