I'm trying to parse the following HTML structure with in perl. I need to select all of the dd elements that contain the class message and also an id. All I would like the script to do is loop through all of the dd elements and print out the id of the dd element but it needs to ignore the first dd element as that is static and will not change.
It can be with any perl module as long as it can be installed from cpan to make it easy for me. I don't have much experience with perl and parsing html so any pointers would be very helpful.
Thanks :)
HTML Structure:
<pre><code>
<html>
<head>
</head>
<body>
.....other elements
<div id="messages">
<div class="header"></div>
<dl>
<dd class="message unread mc-friend mc-message">This is just a random message, do not parse</dd>
<dd id="msg2" class="message unread mc-message">
Hello
</div>
<dd id="msg3" class="message unread mc-message">
Hello
</dd>
</dl>
</div>
</body>
</html>
</pre></code>
The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, which is used to parse HTML files. It comes in handy for web crawling.
Parsing text files is one of the reasons Perl makes a great data mining and scripting tool. As you'll see below, Perl can be used to basically reformat a group of text.
Something like this, quick and easy:
#! /usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $html = "Your HTML goes here";
my $dom = Mojo::DOM->new;
$dom->parse($html);
my $skip;
for my $dd ($dom->find('dd[class*="message"]')->each) {
print $dd->attrs->{id}, "\n" if $skip++;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With