Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse between <div class ="foo"> and </div> easily in Perl

Tags:

html

parsing

perl

I want to parse a Website into a Perl data structure. First I load the page with

use LWP::Simple;
my $html = get("http://f.oo");

Now I know two ways to deal with it. First are the regular expressions and secound the modules.

I started with reading about HTML::Parser and found some examples. But I'm not that sure about by Perl knowledge.

My code example goes on

my @links;

my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse($html);

foreach my $link(@links){
  print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}

sub start_handler{
  return if(shift ne 'a');
  my ($class) = shift->{href};
  my $self = shift;
  my $text;
  $self->handler(text => sub{$text = shift;},"dtext");
  $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname");
}

I don't understand why there is two times a shift. The secound should be the self pointer. But the first makes me think that the self reference is allready shiftet, used as a Hash and the Value for href is stored in $class. Could someone Explain this line (my ($class) = shift->{href};)?

Beside this lack, I do not want to parse all the URLs, I want to put all the code between <div class ="foo"> and </div> into a string, where lots of code is between, specially other <div></div> tags. So I or a module has to find the right end. After that I planed to scan the string again, to find special classes, like <h1>,<h2>, <p class ="foo2"></p>, etc.

I hope this informations helps you to give me some usefull advices, and please have in mind that first of all I want an easy understanding way, which has not to be a great performance in the first level!

like image 256
froehli Avatar asked Dec 19 '11 23:12

froehli


4 Answers

HTML::Parser is more of a tokenizer than a parser. It leaves a lot of hard work up to you. Have you considered using HTML::TreeBuilder (which uses HTML::Parser) or XML::LibXML (a great library which has support for HTML)?

like image 168
ikegami Avatar answered Nov 07 '22 17:11

ikegami


Use HTML::TokeParser::Simple.

Untested code based on your description:

#!/usr/bin/env perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new(url => 'http://example.com/example.html');

my $level;

while (my $tag = $p->get_tag('div')) {
    my $class = $tag->get_attr('class');
    next unless defined($class) and $class eq 'foo';

    $level += 1;

    while (my $token = $p->get_token) {
        $level += 1 if $token->is_start_tag('div');
        $level -= 1 if $token->is_end_tag('div');
        print $token->as_is;
        unless ($level) {
            last;
        }
    }
}
like image 30
Sinan Ünür Avatar answered Nov 07 '22 17:11

Sinan Ünür


No need to get so complicated. You can retrieve and find elements in the DOM using CSS selectors with Mojo::UserAgent:

say Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo');

or, loop through the elements found:

say $_ for Mojo::UserAgent->new->get('http://f.oo')->res->dom
    ->find('div.foo')->each;

or, loop using a callback:

Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo')->each(sub {
  my ($count, $el) = @_;
  say "$count: $el";
});
like image 38
tempire Avatar answered Nov 07 '22 17:11

tempire


According to the docs, the handler's signature is (\%attr, \@attr_seq, $text). There are three shifts, one for each argument.

my ($class) = shift->{href};

is equivalent to:

my $class;
my %attr_seq;
my $attr_seq_ref;

$attr_seq_ref = shift;
%attr_seq = %$attr_seq_ref;
$class = $attr_seq{'href'};
like image 31
Amadan Avatar answered Nov 07 '22 16:11

Amadan