I would like to use a language that I am familiar with - Java, C#, Ruby, PHP, C/C++, although examples in any language or pseudocode are more than welcome.
What is the best way of splitting a large XML document into smaller sections that are still valid XML? For my purposes, I need to split them into roughly thirds or fourths, but for the sake of providing examples, splitting them into n components would be good.
Split large XML file in Windows (Method #1) First, click the “Add XML File(s)” button to provide the input path of the file to split, or easily drag and drop your files. Then select the tag by which the new file will be split. Next, choose after what period of tags to split into a new file.
An XML (EXtensible Markup Language) Document contains declarations, elements, text, and attributes.
Java provides many ways to parse an XML file. There are two parsers in Java which parses an XML file: Java DOM Parser. Java SAX Parser.
To parse XML documents, use the XML PARSE statement, specifying the XML document that is to be parsed and the processing procedure for handling XML events that occur during parsing, as shown in the following code fragment.
Parsing XML documents using DOM doesn't scale.
This Groovy-script is using StAX (Streaming API for XML) to split an XML document between the top-level elements (that shares the same QName as the first child of the root-document). It's pretty fast, handles arbitrary large documents and is very useful when you want to split a large batch-file into smaller pieces.
Requires Groovy on Java 6 or a StAX API and implementation such as Woodstox in the CLASSPATH
import javax.xml.stream.*
pieces = 5
input = "input.xml"
output = "output_%04d.xml"
eventFactory = XMLEventFactory.newInstance()
fileNumber = elementCount = 0
def createEventReader() {
reader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(input))
start = reader.next()
root = reader.nextTag()
firstChild = reader.nextTag()
return reader
}
def createNextEventWriter () {
println "Writing to '${filename = String.format(output, ++fileNumber)}'"
writer = XMLOutputFactory.newInstance().createXMLEventWriter(new FileOutputStream(filename), start.characterEncodingScheme)
writer.add(start)
writer.add(root)
return writer
}
elements = createEventReader().findAll { it.startElement && it.name == firstChild.name }.size()
println "Splitting ${elements} <${firstChild.name.localPart}> elements into ${pieces} pieces"
chunkSize = elements / pieces
writer = createNextEventWriter()
writer.add(firstChild)
createEventReader().each {
if (it.startElement && it.name == firstChild.name) {
if (++elementCount > chunkSize) {
writer.add(eventFactory.createEndDocument())
writer.flush()
writer = createNextEventWriter()
elementCount = 0
}
}
writer.add(it)
}
writer.flush()
Well of course you can always extract the top-level elements (whether this is the granularity you want is up to you). In C#, you'd use the XmlDocument class. For example, if your XML file looked something like this:
<Document>
<Piece>
Some text
</Piece>
<Piece>
Some other text
</Piece>
</Document>
then you'd use code like this to extract all of the Pieces:
XmlDocument doc = new XmlDocument();
doc.Load("<path to xml file>");
XmlNodeList nl = doc.GetElementsByTagName("Piece");
foreach (XmlNode n in nl)
{
// Do something with each Piece node
}
Once you've got the nodes, you can do something with them in your code, or you can transfer the entire text of the node to its own XML document and act on that as if it were an independent piece of XML (including saving it back to disk, etc).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With