Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Serially process XML data in perl

I'm wondering which XML parser people thing would be best in my situation for Perl. I've done a lot of reading and have tried the XML::LibXML and XML::SAX. The first used up too much memory and the second didn't appear to be that quick to me (even after switching off the pure perl parser).

My needs are fairly specific. I am receiving a largish response of up to 50MB via the Net::SSH library. I would like to pass this data to an XML library as I receive it so as to keep the minimum amount of data in memory. I need to then look for data in certain tags and do whatever with it, in some cases sum a bunch of values, in other cases just extract values and write them to files or whatever. So I need an XML parser that can work serially, works quick and uses the minimum of memory. The data I get is in chunks of up to 1024 bytes so I would like to be able to just do something like $myparser->sendData($mynewData) and then have functions called when a new tag is opened or closed similar to what XML::SAX does.

I don't necessarily need XPath or XSLT.

like image 894
MikeKulls Avatar asked Jan 03 '13 05:01

MikeKulls


2 Answers

I would recommend using XML::Twig.

This module is very convenient to use, and also it can read data serially without using much memory.

Probably one of the most distinctive features of XML::Twig is that it permits to parse XML in so-called hybrid model: you can parse whole document (needs whole document and a lot of memory), you can use callbacks to parse small chunks (allows streaming, small memory consumption), or you can use any combination of these.

This combined model turns out to be most convenient feature - load small leaf from the stream, and you can access all its small branches effectively for free.

like image 197
mvp Avatar answered Oct 01 '22 20:10

mvp


You could also go with plain old XML::Parser, which does pretty much just what you ask for:

"This module provides ways to parse XML documents. It is built on top of XML::Parser::Expat, which is a lower level interface to James Clark's expat library. Each call to one of the parsing methods creates a new instance of XML::Parser::Expat which is then used to parse the document. Expat options may be provided when the XML::Parser object is created. These options are then passed on to the Expat object on each parse call. They can also be given as extra arguments to the parse methods, in which case they override options given at XML::Parser creation time."

"Expat is an event based parser. As the parser recognizes parts of the document (say the start or end tag for an XML element), then any handlers registered for that type of an event are called with suitable parameters."

I've used it for parsing Wikipedia XML dumps, which are several GB in size even after compression, and found it to work very well for that. Compared to that, a 50 MB file should be a piece of cake.

like image 42
Ilmari Karonen Avatar answered Oct 01 '22 18:10

Ilmari Karonen