I know that regular (binary) Excel files can be processed via Spreadsheet::ParseExcel.
However, I have a file that is HTML formatted:
<html xmlns:x="urn:schemas-microsoft-com:office:excel">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=windows-1252">
<!--[if gte mso 9]>
<xml>
<x:ExcelWorkbook>
<x:ExcelWorksheets>
<x:ExcelWorksheet>
<x:Name>Holdings</x:Name>
<x:WorksheetOptions>
Short of manually parsing it as a generic HTML file (e.g. TreeBuilder etc..), is there a CPAN module that would parse and let me access such a file as a spreadsheet, similar to Spreadsheet::ParseExcel?
Here's where the module doesn't work:
#!/usr/local/bin/perl
use strict; use warnings;
use Spreadsheet::ParseExcel;
my $parser = Spreadsheet::ParseExcel->new();
my $file = 'file1.xls';
my $workbook;
eval {$workbook = $parser->Parse($file);};
#($Workbook returned here is ‘undef’)
I use an XPath parser to extract what I need from files like this, iterating on ./Cell/Data nodes inside of the //Row nodes, but that's not using the same interface as Spreadsheet::ParseExcel.
I also find that you need to do some source filtering before you can use the XML parser. At a minimum you have to run
s/<xml version>/<!-- xml version -->/;
s/&/&/g
on the input.
Here's a concise but complete solution, extracting a file like this to a 2-D array:
use XML::XPath;
open F, '<', $dirty_file_name;
open G, '>', $clean_file_name;
while(<F>) {
s/<xml version>/<!-- xml version -->/;
s/&/&/g;
print G
}
close G;
close F;
@table = map { [ map { $_->string_value } $_->find('./Cell/Data')->get_nodelist ]
} XML::XPath->new( filename => $clean_file_name )->find('//Row')->get_nodelist;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With