Seems like I'm a little bit lost.
I need to parse a large (about 100 mb) and quite ugly xml file. If I use parsefile
, it returns error (junk after document element), but it would happily parse smaller elements of the file.
So I decided to break the file into elements and parse them. Since parsing XML with regular expressions is discouraged (well I tried it anyway, but I get duplicating results), I tried Text::Balanced
.
Something like
use Text::Balanced qw/extract_tagged/;
while (<FILE>) {
my $result = extract_tagged($_, "<tag>");
print $result if defined $result;
}
works just fine, so I can extract tagged entries which fit into one line. With something bigger, however
use Text::Balanced qw/extract_tagged/;
use File::Slurp;
my $test = read_file("file");
my $result = extract_tagged($text, "<tag>");
print $result;
does not work. It reads the file but it can not find a tagged item there.
So the question is how do I extract anything between given tags without XML::Parser
? And I really really need to avoid chomping it if possible.
P.S. search would return regex guides, heredoc howtos and anything but what I look for
P.P.S. I'm a moron, been trying to parse an invalid file. Still curious how to chop a file if the parser fails though.
bvr's answer was close, it really would retrieve some data, but not if the top level tag is missing.
For broken XML, I would try setting recover
option to XML::LibXML. It makes it ignore parsing errors and continue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With