After encountering xml data files containing huge text nodes, I looked for some ways to read and evaluate them in my data processing scripts.
The xml files are 3D co-ordinate files for molecular modeling applications an have this structure (example):
<?xml version="1.0" encoding="UTF-8"?>
<hoomd_xml version="1.4">
<configuration>
<position>
-0.101000 0.011000 -40.000000
-0.077000 0.008000 -40.469000
-0.008000 0.001000 -40.934000
-0.301000 0.033000 -41.157000
0.213000 -0.023000 -41.348000
...
... 300,000 to 500,000 lines may follow >>
...
-0.140000 0.015000 -42.556000
</position>
<next_huge_section_of_the_same_pattern>
...
...
...
</next_huge_section_of_the_same_pattern>
</configuration>
</hoomd_xml>
Each xml files contains several huge text nodes and has sizes between 60MB and 100MB depending on the contents.
I tried the naíve approch using XML::Simple first but the loader would take forever to initially parse the file:
...
my $data = $xml->XMLin('structure_80mb.xml');
...
and stop with "internal error: huge input lookup", so this approach isn't very practicable.
The next try was to use XML::LibXML for reading - but here, the initial loader would bail out immediately with error message "parser error : xmlSAX2Characters: huge text node".
Befor writing on this topic on stackoverflow, I wrote a q&d parser for myself and sent the file through it (after slurping the xx MB xml file into the scalar $xml
):
...
# read the <position> data from in-memory xml file
my @Coord = xml_parser_hack('position', $xml);
...
which returns the data of each line as an array, completes within seconds and looks like this:
sub xml_parser_hack {
my ($tagname, $xml) = @_;
return () unless $xml =~ /^</;
my @Data = ();
my ($p0, $p1) = (undef,undef);
$p0 = $+[0] if $xml =~ /^<$tagname[^>]*>[^\r\n]*[r\n]+/msg; # start tag
$p1 = $-[0] if $xml =~ /^<\/$tagname[^>]*>/msg; # end tag
return () unless defined $p0 && defined $p1;
my @Lines = split /[\r\n]+/, substr $xml, $p0, $p1-$p0;
for my $line (@Lines) {
push @Data, [ split /\s+/, $line ];
}
return @Data;
}
This works fine so far but cannot considered 'production ready', of course.
Q: How would I read the file using a Perl module? Which module would I choose?
Thanks in advance
rbo
Addendum: after reading choroba's comment, I looked deeper into XML::LibXML.
The opening of the file my $reader = XML::LibXML::Reader->new(location =>'structure_80mb.xml');
works, contrary to what I thought before. The error occurs if I try to access the text node below the tag:
...
while ($reader->read) {
# bails out in the loop iteration after accessing the <position> tag,
# if the position's text node is accessed
# -- xmlSAX2Characters: huge text node ---
...
Try XML::LibXML
with the huge
parser option:
my $doc = XML::LibXML->load_xml(
location => 'structure_80mb.xml',
huge => 1,
);
Or, if you want to use XML::LibXML::Reader
:
my $reader = XML::LibXML::Reader->new(
location => 'structure_80mb.xml',
huge => 1,
);
I was able to simulate an answer using XML::LibXML. Try this, and let me know if it doesn't work. I created an XML doc with more than 500k lines in the position
element, and I was able to parse it and print the contents of it:
use strict;
use warnings;
use XML::LibXML;
my $xml = XML::LibXML->load_xml(location => '/perl/test.xml');
my $nodes = $xml->findnodes('/hoomd_xml/configuration/position');
print $nodes->[0]->textContent . "\n";
print scalar(@{$nodes}) . "\n";
I'm using findnodes
to use an XPath expression to pull out all the nodes that I want. $nodes
is just an array ref, so you can loop through it depending on how many nodes you actually have in your document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With