How to efficiently parse concatenated XML documents from a file

Question

I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.

Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.

<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>

Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.

Any suggestions or tools? I am working in the Java environment.

Edit: I am not sure if the xml-declaration will be present in documents or not.

Edit: Let's assume that the encoding for all the xml docs is UTF-8.

Wim ten Brink · Accepted Answer

Don't split! Add one big tag around it! Then it becomes one XML file again:

<BIGTAG>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
</BIGTAG>

Now, using /BIGTAG/SomeData would give you all the XML roots.

If processing instructions are in the way, you can always use a RegEx to remove them. It's easier to just remove all processing instructions than to use a RegEx to find all root nodes. If encoding differs for all documents then remember this: the whole document itself must have been encoded by some encoding type, thus all those XML documents it includes will be using the same encoding, no matter what each header is telling you. If the big file is encoded as UTF-16 then it doesn't matter if the XML processing instructions say the XML itself is UTF-8. It won't be UTF-8 since the whole file is UTF-16. The encoding in those XML processing instructions is therefor invalid.

By merging them into one file, you've altered the encoding...

By RegEx, I mean regular expressions. You just have to remove all text that's between a <? and a ?> which should not be too difficult with a regular expression and slightly more complicated if you're trying other string manipulation techniques.

How to efficiently parse concatenated XML documents from a file

Tags:

java

parsing

xml

Juha Syrjälä

1 Answers

Wim ten Brink

Recent Activity

Donate For Us

How to efficiently parse concatenated XML documents from a file

Tags:

java

parsing

xml

Juha Syrjälä

1 Answers

Wim ten Brink

Related questions

Recent Activity

Donate For Us