I need to parse a huge file with Perl. (so I'll be using a streaming parser ..) The file contains multiple XML documents (Objects), but no root node. This causes the XML parser to abort after the first Object, as it should. The answer is probably to pre/post fix a fake root node.
<FAKE_ROOT_TAG>Original Stream</FAKE_ROOT_TAG>
Since the file is huge (>1GByte) I don't want to copy/rewrite it, but would rather use a class/module that transparently (for the XML Parser) "merges" or "concatinates" multiple streams.
stream1 : <FAKE_ROOT_TAG> \
stream2 : Original Stream from file > merged stream
stream3 : </FAKE_ROOT_TAG> /
Can you point me to such a module or sample code for this problem?
Here's a simple example of how you might do it by passing a fake filehandle to your XML parser. This object overloads the readline
operator (<>
) to return your fake root tags with the lines from the file in between.
package FakeFile;
use strict;
use warnings;
use overload '<>' => \&my_readline;
sub new {
my $class = shift;
my $filename = shift;
open my $fh, '<', $filename or die "open $filename: $!";
return bless { fh => $fh }, $class;
}
sub my_readline {
my $self = shift;
return if $self->{done};
if ( not $self->{started} ) {
$self->{started} = 1;
return '<fake_root_tag>';
}
if ( eof $self->{fh} ) {
$self->{done} = 1;
return '</fake_root_tag>';
}
return readline $self->{fh};
}
1;
This won't work if your parser expects a genuine filehandle (e.g. using something like sysread
) but perhaps you'll find it inspirational.
Example usage:
echo "one
two
three" > myfile
perl -MFakeFile -E 'my $f = FakeFile->new( "myfile" ); print while <$f>'
Here's a trick pulled from PerlMonks:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Parser;
use XML::LibXML;
my $doc_file= shift @ARGV;
my $xml=qq{
<!DOCTYPE doc
[<!ENTITY real_doc SYSTEM "$doc_file">]
>
<doc>
&real_doc;
</doc>
};
{ print "XML::Parser:\n";
my $t= XML::Parser->new( Style => 'Stream')->parse( $xml);
}
{ print "XML::LibXML:\n";
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml);
print $doc->toString;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With