Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing table contents using Html::TreeBuilder::XPath

I want to extract all the tables from an html file and print their contents in the following way each cell seperated by \t, each row separated by \n and each table separated by \n\n. The following is my script, when I changed it to findvalues on tr then whole tr is inserted as one element, and I even tried the other methods such as findnodes_as_strings ($path), I want to modify it to the above mentioned structure .

use strict;
use warnings;
use HTML::TreeBuilder::XPath;

my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "html.html");

my @values=$tree->findvalues(q{//table//tr//td});

print $_, "\n" foreach(@values);
like image 275
Nishanth Lawrence Reginold Avatar asked May 09 '26 09:05

Nishanth Lawrence Reginold


1 Answers

You need to process each table separately, same for rows:

foreach my $table ( $tree->findnodes('//table') ) {

    foreach my $row ( $table->findnodes('.//tr') ) {

        my @cells = $row->findvalues('.//td');
        print join("\t", @cells), "\n";
    }
    print "\n";
}

Of course this is solution only for simple tables (think about columnspans, th, table inside table etc.)

like image 193
gangabass Avatar answered May 12 '26 01:05

gangabass



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!