Ever since I asked how to parse html with regex and got bashed a bit (rightfully so), I've been studying HTML::TreeBuilder, HTML::Parser, HTML::TokeParser, and HTML::Elements Perl modules.
I have HTML like this:
<div id="listSubtitlesFilm">
<dt id="a1">
<a href="/45/subtitles-67624.aspx">
.45 (2006)
</a>
</dt>
</div>
I want to parse out the /45/subtitles-67624.asp
, but more importantly I want to know how to parse out the contents of the div.
I was given this example on a previous question:
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
#http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitles-272112.aspx
push @dnldLinks, $1 if $href =~ m!/subtitle-(\d{2,8})\.aspx!;
}
This worked perfectly for that, but when I tried to edit it a bit and use it on a ``div` it didn't work. Here is the code I tried:
I tried using this code:
while (my $anchor = $p->get_tag("dt")) {
if($stuff = $anchor->get_attr('a1')) {
print $stuff."\n";
}
}
To address, your specific question, given the HTML:
<div id="listSubtitlesFilm">
<dt id="a1">
<a href="/45/subtitles-67624.aspx">
.45 (2006)
</a>
</dt>
</div>
I am assuming you are interested in the anchor text, i.e. ".45 (2006)"
, in this case, but only if the anchor occurs in a div
with id listSubtitlesFilm
.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);
my @dnldLinks;
while ( my $div = $parser->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'listSubtitlesFilm';
my $anchor = $parser->get_tag('a');
my $href = $anchor->get_attr('href');
next unless defined($href)
and $href =~ m!/subtitles-(\d{2,8})\.aspx\z!;
push @dnldLinks, [$parser->get_trimmed_text('/a'), $1];
}
use Data::Dumper;
print Dumper \@dnldLinks;
__DATA__
<div id="listSubtitlesFilm">
<dt id="a1">
<a href="/45/subtitles-67624.aspx">
.45 (2006)
</a>
</dt>
</div>
Output:
$VAR1 = [ [ '.45 (2006)', '67624' ] ];
You could use (yet another module!) HTML::TreeBuilder::XPath, which, as per its name, will let you use XPath on HTML::TreeBuilder objects.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $root = HTML::TreeBuilder::XPath->new_from_file( "my.html");
# print $root->as_HTML; # useful to see how HTML::TreeBuilder
# understands your HTML. For example it will wrap the implied
# dl element around dt, which you need to take into account
# when writing the XPath query below
my $id= "a1";
# you need the .//dt because of the extra dl
my @divs= $root->findnodes( qq{//div[.//dt[\@id="$id"]]});
print $divs[0]->as_HTML; # or as_text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With