Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently parse concatenated XML documents from a file

Tags:

java

parsing

xml

I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.

Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.

<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>

Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.

Any suggestions or tools? I am working in the Java environment.

Edit: I am not sure if the xml-declaration will be present in documents or not.

Edit: Let's assume that the encoding for all the xml docs is UTF-8.

like image 695
Juha Syrjälä Avatar asked Aug 24 '09 12:08

Juha Syrjälä


1 Answers

Don't split! Add one big tag around it! Then it becomes one XML file again:

<BIGTAG>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
</BIGTAG>

Now, using /BIGTAG/SomeData would give you all the XML roots.


If processing instructions are in the way, you can always use a RegEx to remove them. It's easier to just remove all processing instructions than to use a RegEx to find all root nodes. If encoding differs for all documents then remember this: the whole document itself must have been encoded by some encoding type, thus all those XML documents it includes will be using the same encoding, no matter what each header is telling you. If the big file is encoded as UTF-16 then it doesn't matter if the XML processing instructions say the XML itself is UTF-8. It won't be UTF-8 since the whole file is UTF-16. The encoding in those XML processing instructions is therefor invalid.

By merging them into one file, you've altered the encoding...


By RegEx, I mean regular expressions. You just have to remove all text that's between a <? and a ?> which should not be too difficult with a regular expression and slightly more complicated if you're trying other string manipulation techniques.
like image 104
Wim ten Brink Avatar answered Sep 19 '22 22:09

Wim ten Brink