Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove unwanted tags from XML

I have a huge XML and I want to remove unwanted tags from this. Ex.'

<orgs>
    <org name="Test1">
        <item>a</item>
        <item>b</item>
    </org>
    <org name="Test2">
        <item>c</item>
        <item>b</item>
        <item>e</item>
    </org>
</orgs>

I want to remove all the <item>b</item> from this xml. Which parser api should be use for this as xml is very large and How can achieve it.

like image 307
Sai prateek Avatar asked Jan 10 '23 02:01

Sai prateek


1 Answers

One approach would be to use a Document Object Model (DOM), the draw back to this, as the name suggests, it needs to load the entire document into memory and Java's DOM API is quite memory hungry. The benefit is, you can take advantage of XPath to find the offending nodes

Take a closer look at Java API for XML Processing (JAXP) for more details and other APIs

Step: 1 Load the document

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new File("..."));

Set 2: Find the offending nodes

XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xExpress = xPath.compile("/orgs/org/item[text()='b']");
NodeList nodeList = (NodeList) xExpress.evaluate(doc.getDocumentElement(), XPathConstants.NODESET);

Set 3: Remove offending nodes

Okay, this is not as simple as it should be. Removing a node can leave a blank space in the document, which would be "nice" to clean up. The following method is a simple library method I adapted from some internet code(s) I found, which will remove the specified Node, but will also remove any white space/text nodes as well

public static void removeNode(Node node) {
    if (node != null) {
        while (node.hasChildNodes()) {
            removeNode(node.getFirstChild());
        }

        Node parent = node.getParentNode();
        if (parent != null) {
            parent.removeChild(node);
            NodeList childNodes = parent.getChildNodes();
            if (childNodes.getLength() > 0) {
                List<Node> lstTextNodes = new ArrayList<Node>(childNodes.getLength());
                for (int index = 0; index < childNodes.getLength(); index++) {
                    Node childNode = childNodes.item(index);
                    if (childNode.getNodeType() == Node.TEXT_NODE) {
                        lstTextNodes.add(childNode);
                    }
                }
                for (Node txtNodes : lstTextNodes) {
                    removeNode(txtNodes);
                }
            }
        }
    }
}

Loop over the offending nodes...

for (int index = 0; index < nodeList.getLength(); index++) {
    Node node = nodeList.item(index);
    removeNode(node);
}

Step 4: Save the result

Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.setOutputProperty(OutputKeys.METHOD, "xml");
tf.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");

DOMSource domSource = new DOMSource(doc);
StreamResult sr = new StreamResult(System.out);
tf.transform(domSource, sr);

Which outputs something like...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<orgs>
  <org name="Test1">
    <item>a</item>
  </org>
  <org name="Test2">
    <item>c</item>
    <item>e</item>
  </org>
</orgs>
like image 50
MadProgrammer Avatar answered Jan 19 '23 01:01

MadProgrammer