Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse <rss> tag with XML::LibXML to find xmlns defintions

Tags:

rss

perl

It seems that there is no consistent way that podcasts define their rss feeds. Ran into one that is using different schema defs for the RSS.

What's the best way to scan for xmlnamespace in an RSS url, using XML::LibXML

E.g.

One feed might be

<rss 
    xmlns:content="http://purl.org/rss/1.0/modules/content/" 
    xmlns:wfw="http://wellformedweb.org/CommentAPI/" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:atom="http://www.w3.org/2005/Atom" 
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" 
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

Another might be

<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"version="2.0"
     xmlns:atom="http://www.w3.org/2005/Atom">

I want to include in my script an assessment of all the namespaces being used so that when parsing the rss, the appropriate field names can be tracked.

Not sure what that will look like yet, as I'm not sure this module has the capability to do the <rss> tag attribute atomization that I want.

like image 447
Ken Ingram Avatar asked Mar 05 '23 13:03

Ken Ingram


2 Answers

I'm not sure I understand exactly what kind of output you're looking for, but XML::LibXML is indeed able to list the namespaces:

use warnings;
use strict;
use XML::LibXML;

my $dom = XML::LibXML->load_xml(string => <<'EOT');
<rss 
    xmlns:content="http://purl.org/rss/1.0/modules/content/" 
    xmlns:wfw="http://wellformedweb.org/CommentAPI/" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:atom="http://www.w3.org/2005/Atom" 
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" 
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
</rss>
EOT
for my $ns ($dom->documentElement->getNamespaces) {
    print $ns->getLocalName(), " / ", $ns->getData(), "\n";
}

Output:

content / http://purl.org/rss/1.0/modules/content/
wfw / http://wellformedweb.org/CommentAPI/
dc / http://purl.org/dc/elements/1.1/
atom / http://www.w3.org/2005/Atom
sy / http://purl.org/rss/1.0/modules/syndication/
slash / http://purl.org/rss/1.0/modules/slash/
like image 67
haukex Avatar answered Mar 11 '23 09:03

haukex


I know that OP has already accepted an answer. But for completeness sake it should be mentioned that the recommended way to make searches on the DOM resilient is to use XML::LibXML::XPathContext:

#!/usr/bin/perl
use strict;
use warnings;

use XML::LibXML;

my @examples = (
    <<EOT
<rss xmlns:atom="http://www.w3.org/2005/Atom">
  <atom:test>One Ring to rule them all,</atom:test>
</rss>
EOT
    ,
    <<EOT
<rss xmlns:a="http://www.w3.org/2005/Atom">
  <a:test>One Ring to find them,</a:test>
</rss>
EOT
    ,
    <<EOT
<rss xmlns="http://www.w3.org/2005/Atom">
  <test>The end...</test>
</rss>
EOT
    ,
);

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs('atom', 'http://www.w3.org/2005/Atom');

for my $example (@examples) {
    my $dom = XML::LibXML->load_xml(string => $example)
        or die "XML: $!\n";

    for my $node ($xpc->findnodes("//atom:test", $dom)) {
        printf("%-10s: %s\n", $node->nodeName, $node->textContent);
    }
}

exit 0;

i.e. you assign a local namespace prefix for those namespaces you are interested in.

Output:

$ perl dummy.pl
atom:test : One Ring to rule them all,
a:test    : One Ring to find them,
test      : The end...
like image 20
Stefan Becker Avatar answered Mar 11 '23 09:03

Stefan Becker