Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good CPAN parser for HTML MS Excel files?

I know that regular (binary) Excel files can be processed via Spreadsheet::ParseExcel.

However, I have a file that is HTML formatted:

<html xmlns:x="urn:schemas-microsoft-com:office:excel">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=windows-1252">
<!--[if gte mso 9]>
<xml>
<x:ExcelWorkbook>
<x:ExcelWorksheets>
<x:ExcelWorksheet>
<x:Name>Holdings</x:Name>
<x:WorksheetOptions>

Short of manually parsing it as a generic HTML file (e.g. TreeBuilder etc..), is there a CPAN module that would parse and let me access such a file as a spreadsheet, similar to Spreadsheet::ParseExcel?

Here's where the module doesn't work:

#!/usr/local/bin/perl
use strict; use warnings;
use Spreadsheet::ParseExcel;
my $parser   = Spreadsheet::ParseExcel->new();
my $file     = 'file1.xls';
my $workbook;
eval {$workbook   = $parser->Parse($file);}; 
#($Workbook returned here is ‘undef’)
like image 918
DVK Avatar asked Dec 18 '25 18:12

DVK


1 Answers

I use an XPath parser to extract what I need from files like this, iterating on ./Cell/Data nodes inside of the //Row nodes, but that's not using the same interface as Spreadsheet::ParseExcel.

I also find that you need to do some source filtering before you can use the XML parser. At a minimum you have to run

s/<xml version>/<!-- xml version -->/;
s/&/&amp;/g

on the input.


Here's a concise but complete solution, extracting a file like this to a 2-D array:

use XML::XPath;
open F, '<', $dirty_file_name;
open G, '>', $clean_file_name;
while(<F>) { 
    s/<xml version>/<!-- xml version -->/;
    s/&/&amp;/g;
    print G
}
close G;
close F;

@table = map { [ map { $_->string_value } $_->find('./Cell/Data')->get_nodelist ]
  } XML::XPath->new( filename => $clean_file_name )->find('//Row')->get_nodelist;
like image 146
mob Avatar answered Dec 20 '25 12:12

mob



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!