How can I extract data from HTML tables in Perl?

Question

I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:

<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>

Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.

I've tried many regular expressions such as:

$lines =~ / (<td>) /
       OR
$lines =~ / <td>(.*)< /
       OR
$lines =~ / >(.*)< /

My current program looks like so:

#!perl -w

open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;

open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;

my $lines = join '', <INPUT_FILE>;

print "Hello 2
";

if ($lines =~ / (\S.*\S) /) {
print "this is 1: 
";
print $1;
    if ($lines =~ / <td>(.*)< / ) {
    print "this is the 2nd 1: 
";
    print $1;
    print "the word was: $1.
";
    $Time = $1;
    print $Time;
    print OUTPUT_FILE $Time;
    } else {
    print "2ND IF FAILED
";
    }
} else { 
print "THIS FAILED
";
}

close(INPUT_FILE);
close(OUTPUT_FILE);

Ether · Accepted Answer

Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.

Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
HTML::Parser
HTML::TreeBuilder
HTML::TableExtract

Sinan Ünür · Answer

Use HTML::TableExtract. Really.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'Table3.htm';
unless ( -e $file ) {
    my $rc = getstore(
        'http://www.ntsb.gov/aviation/Table3.htm',
        $file);
    die "Failed to download document
" unless $rc == 200;
}

my @headers = qw( Year Fatalities );

my $te = HTML::TableExtract->new(
    headers => \@headers,
    attribs => { id => 'myTable' },
);

$te->parse_file($file);

my ($table) = $te->tables;

print join("	", @headers), "
";

for my $row ($te->rows ) {
    print join("	", @$row), "
";
}

This is what I meant in another post by "task-specific" HTML parsers.

You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.

How can I extract data from HTML tables in Perl?

Tags:

html

parsing

perl

nick

2 Answers

Ether

Sinan Ünür

Recent Activity

Donate For Us

How can I extract data from HTML tables in Perl?

Tags:

html

parsing

perl

nick

2 Answers

Ether

Sinan Ünür

Related questions

Recent Activity

Donate For Us