Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl: How to handle a stream of XML Objects without a root node

I need to parse a huge file with Perl. (so I'll be using a streaming parser ..) The file contains multiple XML documents (Objects), but no root node. This causes the XML parser to abort after the first Object, as it should. The answer is probably to pre/post fix a fake root node.

<FAKE_ROOT_TAG>Original Stream</FAKE_ROOT_TAG>

Since the file is huge (>1GByte) I don't want to copy/rewrite it, but would rather use a class/module that transparently (for the XML Parser) "merges" or "concatinates" multiple streams.

stream1 : <FAKE_ROOT_TAG>                 \
stream2 : Original Stream from file        >   merged stream
stream3 : </FAKE_ROOT_TAG>                / 

Can you point me to such a module or sample code for this problem?

like image 843
lexu Avatar asked Jan 11 '23 23:01

lexu


2 Answers

Here's a simple example of how you might do it by passing a fake filehandle to your XML parser. This object overloads the readline operator (<>) to return your fake root tags with the lines from the file in between.

package FakeFile;

use strict;
use warnings;

use overload '<>' => \&my_readline;

sub new {
    my $class = shift;
    my $filename  = shift;

    open my $fh, '<', $filename or die "open $filename: $!";

    return bless { fh => $fh }, $class;
}

sub my_readline {
    my $self = shift;
    return if $self->{done};

    if ( not $self->{started} ) {
        $self->{started} = 1;
        return '<fake_root_tag>';
    }

    if ( eof $self->{fh} ) {
        $self->{done} = 1;
        return '</fake_root_tag>';
    }

    return readline $self->{fh};
}


1;

This won't work if your parser expects a genuine filehandle (e.g. using something like sysread) but perhaps you'll find it inspirational.

Example usage:

echo "one
two
three" > myfile
perl -MFakeFile -E 'my $f = FakeFile->new( "myfile" ); print while <$f>' 
like image 192
friedo Avatar answered Jan 13 '23 12:01

friedo


Here's a trick pulled from PerlMonks:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Parser;
use XML::LibXML;

my $doc_file= shift @ARGV;

my $xml=qq{
     <!DOCTYPE doc 
           [<!ENTITY real_doc SYSTEM "$doc_file">]
     >
     <doc>
         &real_doc;
     </doc>
};

{ print "XML::Parser:\n";
  my $t= XML::Parser->new( Style => 'Stream')->parse( $xml);
}

{ print "XML::LibXML:\n";
  my $parser = XML::LibXML->new();
  my $doc = $parser->parse_string($xml);
  print $doc->toString;
}
like image 38
runrig Avatar answered Jan 13 '23 13:01

runrig