Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing xml file contents without knowing xml file structure

Tags:

java

xml

I've been working on learning some new tech using java to parse files and for the msot part it's going well. However, I'm at a lost as to how I could parse an xml file to where the structure is not known upon receipt. Lots of examples of how to do so if you know the structure (getElementByTagName seems to be the way to go), but no dynamic options, at least not that I've found.

So the tl;dr version of this question, how can I parse an xml file where I cannot rely on knowing it's structure?

like image 697
canadiancreed Avatar asked Feb 23 '14 01:02

canadiancreed


People also ask

How can I see the structure of an XML file?

You can view XML files in different ways including using a text editor, like Notepad or TextEdit, a web browser like Safari, Chrome, or Firefox, or an XML viewer. Open your text editor or XML viewer, then open your XML to view it. Drag and drop the XML file to your web browser to view it.

How do I read the contents of an XML file?

XML files are encoded in plaintext, so you can open them in any text editor and be able to clearly read it. Right-click the XML file and select "Open With." This will display a list of programs to open the file in. Select "Notepad" (Windows) or "TextEdit" (Mac).

What is XML decoding?

XML Decoder: as the name suggests, it is a tool to decode the text which is already encoded for XML's predefined entities. The XML escape codes present in the text will be converted to their corresponding XML predefined entities. See XML predefined entities here.


1 Answers

Well the parsing part is easy; like helderdarocha stated in the comments, the parser only requires valid XML, it does not care about the structure. You can use Java's standard DocumentBuilder to obtain a Document:

InputStream in = new FileInputStream(...);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);

(If you're parsing multiple documents, you can keep reusing the same DocumentBuilder.)

Then you can start with the root document element and use familiar DOM methods from there on out:

Element root = doc.getDocumentElement(); // perform DOM operations starting here.

As for processing it, well it really depends on what you want to do with it, but you can use the methods of Node like getFirstChild() and getNextSibling() to iterate through children and process as you see fit based on structure, tags, and attributes.

Consider the following example:

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;   
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;


public class XML {

    public static void main (String[] args) throws Exception {

        String xml = "<objects><circle color='red'/><circle color='green'/><rectangle>hello</rectangle><glumble/></objects>";

        // parse
        InputStream in = new ByteArrayInputStream(xml.getBytes("utf-8"));
        Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);

        // process
        Node objects = doc.getDocumentElement();
        for (Node object = objects.getFirstChild(); object != null; object = object.getNextSibling()) {
            if (object instanceof Element) {
                Element e = (Element)object;
                if (e.getTagName().equalsIgnoreCase("circle")) {
                    String color = e.getAttribute("color");
                    System.out.println("It's a " + color + " circle!");
                } else if (e.getTagName().equalsIgnoreCase("rectangle")) {
                    String text = e.getTextContent();
                    System.out.println("It's a rectangle that says \"" + text + "\".");
                } else {
                    System.out.println("I don't know what a " + e.getTagName() + " is for.");
                }
            }
        }

    }

}

The input XML document (hard-coded for example) is:

<objects>
    <circle color='red'/>
    <circle color='green'/>
    <rectangle>hello</rectangle>
    <glumble/>
</objects>

The output is:

It's a red circle!
It's a green circle!
It's a rectangle that says "hello".
I don't know what a glumble is for.
like image 166
Jason C Avatar answered Nov 10 '22 00:11

Jason C