I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:
<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>
Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.
I've tried many regular expressions such as:
$lines =~ / (<td>) /
OR
$lines =~ / <td>(.*)< /
OR
$lines =~ / >(.*)< /
My current program looks like so:
#!perl -w
open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;
open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;
my $lines = join '', <INPUT_FILE>;
print "Hello 2\n";
if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
if ($lines =~ / <td>(.*)< / ) {
print "this is the 2nd 1: \n";
print $1;
print "the word was: $1.\n";
$Time = $1;
print $Time;
print OUTPUT_FILE $Time;
} else {
print "2ND IF FAILED\n";
}
} else {
print "THIS FAILED\n";
}
close(INPUT_FILE);
close(OUTPUT_FILE);
Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.
Use HTML::TableExtract. Really.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $file = 'Table3.htm';
unless ( -e $file ) {
my $rc = getstore(
'http://www.ntsb.gov/aviation/Table3.htm',
$file);
die "Failed to download document\n" unless $rc == 200;
}
my @headers = qw( Year Fatalities );
my $te = HTML::TableExtract->new(
headers => \@headers,
attribs => { id => 'myTable' },
);
$te->parse_file($file);
my ($table) = $te->tables;
print join("\t", @headers), "\n";
for my $row ($te->rows ) {
print join("\t", @$row), "\n";
}
This is what I meant in another post by "task-specific" HTML parsers.
You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With