I have a huge XML and I want to remove unwanted tags from this. Ex.'
<orgs>
<org name="Test1">
<item>a</item>
<item>b</item>
</org>
<org name="Test2">
<item>c</item>
<item>b</item>
<item>e</item>
</org>
</orgs>
I want to remove all the <item>b</item>
from this xml. Which parser api should be use for this as xml is very large and How can achieve it.
One approach would be to use a Document Object Model (DOM), the draw back to this, as the name suggests, it needs to load the entire document into memory and Java's DOM API is quite memory hungry. The benefit is, you can take advantage of XPath to find the offending nodes
Take a closer look at Java API for XML Processing (JAXP) for more details and other APIs
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new File("..."));
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xExpress = xPath.compile("/orgs/org/item[text()='b']");
NodeList nodeList = (NodeList) xExpress.evaluate(doc.getDocumentElement(), XPathConstants.NODESET);
Okay, this is not as simple as it should be. Removing a node can leave a blank space in the document, which would be "nice" to clean up. The following method is a simple library method I adapted from some internet code(s) I found, which will remove the specified Node
, but will also remove any white space/text nodes as well
public static void removeNode(Node node) {
if (node != null) {
while (node.hasChildNodes()) {
removeNode(node.getFirstChild());
}
Node parent = node.getParentNode();
if (parent != null) {
parent.removeChild(node);
NodeList childNodes = parent.getChildNodes();
if (childNodes.getLength() > 0) {
List<Node> lstTextNodes = new ArrayList<Node>(childNodes.getLength());
for (int index = 0; index < childNodes.getLength(); index++) {
Node childNode = childNodes.item(index);
if (childNode.getNodeType() == Node.TEXT_NODE) {
lstTextNodes.add(childNode);
}
}
for (Node txtNodes : lstTextNodes) {
removeNode(txtNodes);
}
}
}
}
}
Loop over the offending nodes...
for (int index = 0; index < nodeList.getLength(); index++) {
Node node = nodeList.item(index);
removeNode(node);
}
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.setOutputProperty(OutputKeys.METHOD, "xml");
tf.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource domSource = new DOMSource(doc);
StreamResult sr = new StreamResult(System.out);
tf.transform(domSource, sr);
Which outputs something like...
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<orgs>
<org name="Test1">
<item>a</item>
</org>
<org name="Test2">
<item>c</item>
<item>e</item>
</org>
</orgs>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With