Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete xml elements/nodes from xml file larger than available RAM?

Tags:

php

xml

I'm trying to figure out how to delete an element (and its children) from a xml file that is very large in php (latest version).

I know I can use dom and simpleXml, but that will require the document to be loaded into memory.

I am looking at the XML writer/reader/parser functions and googling, but there seems to be nothing on the subject (all answers recommend using dom or simpleXml). That cannot be correct--am I missing something?

The closest thing I've found is this (C#):

You can use an XmlReader to sequentially read your xml (ReadOuterXml might be useful in your case to read a whole node at a time). Then use an XmlWriter to write out all the nodes you want to keep. ( Deleting nodes from large XML files )

Really? Is that the approach? I have to copy the entire huge file?

Is there really no other way?

One approcah

As suggested,

I could read the data using phps XML reader or parser, possibly buffer it, and write/dump+append it back to a new file.

But is this approach really practical?

I have experience with splitting huge xml files into smaller pieces, basically using suggested method, and it took a very long time for the process to finish.

My dataset isn’t currently big enough to give me an idea on how this would work out. I could only assume that the results will be the same (a very slow process).

Does anybody have experience of applying this in practice?

like image 737
user1267259 Avatar asked Aug 11 '12 21:08

user1267259


1 Answers

There are a couple ways to process large documents incrementally, so that you do not need to load the entire structure into memory at once. In either case, yes, you will need to write back out the elements that you wish to keep and omit those you want to remove.

  1. PHP has an XMLReader implementation of a pull parser. An explanation:

    A pull parser creates an iterator that sequentially visits the various elements, attributes, and data in an XML document. Code which uses this iterator can test the current item (to tell, for example, whether it is a start or end element, or text), and inspect its attributes (local name, namespace, values of XML attributes, value of text, etc.), and can also move the iterator to the next item. The code can thus extract information from the document as it traverses it.

  2. Or you could use the SAX XML Parser. Explanation:

    Simple API for XML (SAX) is a lexical, event-driven interface in which a document is read serially and its contents are reported as callbacks to various methods on a handler object of the user's design. SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed.

A lot of people prefer the pull method, but either meets your requirement. Keep in mind that large is relative. If the document fits in memory, then it will almost always be easier to use the DOM. But for really, really large documents that simply might not be an option.

like image 121
Wayne Avatar answered Sep 23 '22 02:09

Wayne