Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Spreadsheet::ParseExcel in Perl, but need help

I have a Perl program using Spreadsheet::ParseExcel. However, there are two difficulties that have arisen that I have been unable to figure out how to solve. The script for the program is as follows:

#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use WordNet::Similarity::lesk;
use WordNet::QueryData;

my $wn = WordNet::QueryData->new();
my $lesk = WordNet::Similarity::lesk->new($wn);
my $parser = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse ( 'input.xls' );

if ( !defined $workbook ) {
   die $parser->error(), ".\n";
}

WORKSHEET:
for my $worksheet ( $workbook->worksheets() ) {

    my $sheetname = $worksheet->get_name();
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();
    my $target_col;
    my $response_col;

# Skip worksheet if it doesn't contain data
    if ( $row_min > $row_max ) {
       warn "\tWorksheet $sheetname doesn't contain data. \n";
       next WORKSHEET;
    }

# Check for column headers
    COLUMN:
    for my $col ( $col_min .. $col_max ) {

        my $cell = $worksheet->get_cell( $row_min, $col );
        next COLUMN unless $cell;

        $target_col   = $col if $cell->value() eq 'Target';
        $response_col = $col if $cell->value() eq 'Response';
    }

    if ( defined $target_col && defined $response_col ) {

        ROW:
        for my $row ( $row_min + 1 .. $row_max ) {
            my $target_cell   = $worksheet->get_cell( $row, $target_col);
            my $response_cell = $worksheet->get_cell( $row, $response_col);
            if ( defined $target_cell && defined $response_cell ) {
                my $target   = $target_cell->value();
                my $response = $response_cell->value();

                my $value    = $lesk->getRelatedness( $target, $response );

                print "Worksheet   = $sheetname\n";
                print "Row         = $row\n";
                print "Target      = $target\n";
                print "Response    = $response\n";
                print "Relatedness = $value\n";                

            }
            else {

                warn "\tWroksheet $sheetname, Row = $row doesn't contain target and response data.\n";
                next ROW;
            }
        }    
    }
    else {

        warn "\tWorksheet $sheetname: Didn't find Target and Response headings.\n";
        next WORKSHEET;
    }  
}

So, my two problems:

First of all, sometimes the program returns the error "No Excel data found in file," even though the data is there. Each Excel file is formatted the same way. There is only one sheet, with the A and B columns labelled 'Target' and 'Response,' respectively, with a list of words beneath them. However, it does not ALWAYS return this error. It works for one Excel file, but it does not work for a different one, even though both are formatted the exact same way (and yes, they are both the same file type, as well). I cannot find any reason for it to not read the second file, because it is identical to the first. The only difference is that the second file was created using an Excel macro; however, why would that matter? The file types and format are exactly the same.

Second, the variables '$target' and '$response' need to be formatted as strings in order for the 'my $value' expression to work. How do I convert them into string format? The value assigned to each variable is a word from the appropriate cell of the Excel spreadsheet. I don't know what format that is (and there is no apparent way in Perl for me to check).

Any suggestions?

like image 998
Billy Avatar asked Jul 27 '11 15:07

Billy


2 Answers

In relation to your first question, the "no data found" error indicates some problem with the file format. I've seen this error with pseudo-Excel files such as Html or CSV files that have an xls extension. I've also seen this error with mal-formed files generated by third party apps.

You could do an initial verification of the files by doing a hexdump/xxd dump of a working and non working file and seeing if the overall structure is approximately the same (for example if it has similar magic numbers at the start and isn't Html).

It could also be an issue with Spreadsheet::ParseExcel. I am the maintainer of that module. If you like you could send me on a "good" and "bad" file, at the email address in the docs, and I will have a look at them.

like image 163
jmcnamara Avatar answered Oct 12 '22 09:10

jmcnamara


First of all, if you are getting "no data found" you can thank proprietary Excel data file formats and the inability of even a good Perl library to extract information from them.

I strongly suggest that you export the Excel data in something easily parsed like CSV especially given the simple nature of the data layout you described. There may be a way to get Excel to process a batch but I have no idea. A quick search yielded a tool to use OpenOffice to do batch conversion.

The rest of your question is rather moot once you accept that Excel data files will not play nicely.

like image 26
msw Avatar answered Oct 12 '22 09:10

msw