Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML parsing in perl

I'm trying to parse the following HTML structure with in perl. I need to select all of the dd elements that contain the class message and also an id. All I would like the script to do is loop through all of the dd elements and print out the id of the dd element but it needs to ignore the first dd element as that is static and will not change.

It can be with any perl module as long as it can be installed from cpan to make it easy for me. I don't have much experience with perl and parsing html so any pointers would be very helpful.

Thanks :)

HTML Structure:

<pre><code>
<html>
<head>
</head>
<body>
 .....other elements
    <div id="messages">
        <div class="header"></div>
        <dl>
            <dd class="message unread mc-friend mc-message">This is just a random message, do not parse</dd>
            <dd id="msg2" class="message unread mc-message">
                Hello
            </div>
            <dd id="msg3" class="message unread mc-message">
                Hello
            </dd>
        </dl>
    </div>
</body>
</html>
</pre></code>
like image 751
Jack Avatar asked Jan 04 '11 20:01

Jack


People also ask

What is a HTML parser?

The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, ​which is used to parse HTML files. It comes in handy for web crawling​.

What is parse in Perl?

Parsing text files is one of the reasons Perl makes a great data mining and scripting tool. As you'll see below, Perl can be used to basically reformat a group of text.


1 Answers

Something like this, quick and easy:

#! /usr/bin/perl
use strict;
use warnings;

use Mojo::DOM;

my $html = "Your HTML goes here";

my $dom = Mojo::DOM->new;
$dom->parse($html);
my $skip;
for my $dd ($dom->find('dd[class*="message"]')->each) {
    print $dd->attrs->{id}, "\n" if $skip++;
}
like image 182
Grrrr Avatar answered Oct 02 '22 12:10

Grrrr