Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Practicable way of reading xml with huge text nodes in Perl

After encountering xml data files containing huge text nodes, I looked for some ways to read and evaluate them in my data processing scripts.

The xml files are 3D co-ordinate files for molecular modeling applications an have this structure (example):

<?xml version="1.0" encoding="UTF-8"?>
<hoomd_xml version="1.4">
   <configuration>
      <position>
        -0.101000   0.011000  -40.000000
        -0.077000   0.008000  -40.469000
        -0.008000   0.001000  -40.934000
        -0.301000   0.033000  -41.157000
         0.213000  -0.023000  -41.348000
         ...
         ... 300,000 to 500,000 lines may follow  >>
         ...
        -0.140000   0.015000  -42.556000
      </position>

      <next_huge_section_of_the_same_pattern>
        ...
        ...
        ...
      </next_huge_section_of_the_same_pattern>

   </configuration>
</hoomd_xml>

Each xml files contains several huge text nodes and has sizes between 60MB and 100MB depending on the contents.

I tried the naíve approch using XML::Simple first but the loader would take forever to initially parse the file:

...
my $data = $xml->XMLin('structure_80mb.xml');
...

and stop with "internal error: huge input lookup", so this approach isn't very practicable.

The next try was to use XML::LibXML for reading - but here, the initial loader would bail out immediately with error message "parser error : xmlSAX2Characters: huge text node".

Befor writing on this topic on stackoverflow, I wrote a q&d parser for myself and sent the file through it (after slurping the xx MB xml file into the scalar $xml):

...
# read the <position> data from in-memory xml file
my @Coord = xml_parser_hack('position', $xml);
...

which returns the data of each line as an array, completes within seconds and looks like this:

sub xml_parser_hack {
 my ($tagname, $xml) = @_;
 return () unless $xml =~ /^</;

 my @Data = ();
 my ($p0, $p1) = (undef,undef);
 $p0 = $+[0] if $xml =~ /^<$tagname[^>]*>[^\r\n]*[r\n]+/msg; # start tag
 $p1 = $-[0] if $xml =~ /^<\/$tagname[^>]*>/msg;             # end tag
 return () unless defined $p0 && defined $p1;
 my @Lines = split /[\r\n]+/, substr $xml, $p0, $p1-$p0;
 for my $line (@Lines) {
    push @Data, [ split /\s+/, $line ];
 }
 return @Data;
}

This works fine so far but cannot considered 'production ready', of course.

Q: How would I read the file using a Perl module? Which module would I choose?

Thanks in advance

rbo


Addendum: after reading choroba's comment, I looked deeper into XML::LibXML. The opening of the file my $reader = XML::LibXML::Reader->new(location =>'structure_80mb.xml'); works, contrary to what I thought before. The error occurs if I try to access the text node below the tag:

...
while ($reader->read) {
   # bails out in the loop iteration after accessing the <position> tag,
   # if the position's text node is accessed
   #   --  xmlSAX2Characters: huge text node ---
...
like image 618
rubber boots Avatar asked Oct 22 '22 08:10

rubber boots


2 Answers

Try XML::LibXML with the huge parser option:

my $doc = XML::LibXML->load_xml(
    location => 'structure_80mb.xml',
    huge     => 1,
);

Or, if you want to use XML::LibXML::Reader:

my $reader = XML::LibXML::Reader->new(
    location => 'structure_80mb.xml',
    huge     => 1,
);
like image 198
nwellnhof Avatar answered Oct 27 '22 21:10

nwellnhof


I was able to simulate an answer using XML::LibXML. Try this, and let me know if it doesn't work. I created an XML doc with more than 500k lines in the position element, and I was able to parse it and print the contents of it:

use strict;
use warnings;
use XML::LibXML;

my $xml = XML::LibXML->load_xml(location => '/perl/test.xml');
my $nodes = $xml->findnodes('/hoomd_xml/configuration/position');
print $nodes->[0]->textContent . "\n";
print scalar(@{$nodes}) . "\n";

I'm using findnodes to use an XPath expression to pull out all the nodes that I want. $nodes is just an array ref, so you can loop through it depending on how many nodes you actually have in your document.

like image 20
Joel Avatar answered Oct 27 '22 20:10

Joel