Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Perl to read French characters from an Excel spreadsheet

Tags:

excel

perl

I am using Spreadsheet::ParseExcel to parse an Excel spreadsheet file as follows

my $FileName = "../excel.xls";
my $parser   = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse($FileName);

And reading values from the cells like this

$product = $worksheeto->get_cell( $row, 0 )->value();

The problem is that, when there is a French character, for instance à, it shows ò

To be sure that the there is no error in the parsing I used

print unpack('H*', $product) . "\n";

So when I use any online hex to string converter I do get the à.

I also tried

use utf8;
binmode(STDOUT, ":utf8");

but instead of à I get

Is there a way to get the correct characters?

like image 360
Mugiwara Avatar asked Oct 01 '13 12:10

Mugiwara


2 Answers

Try parsing the file with a formatter, for example the Spreadsheet::ParseExcel::FmtUnicode:

use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::FmtUnicode;
#use Spreadsheet::ParseExcel::FmtJapan;

my $FileName = '../excel.xls';
my $parser   = Spreadsheet::ParseExcel->new();             
my $formatter = Spreadsheet::ParseExcel::FmtUnicode->new();
my $workbook = $parser->parse($FileName,$formatter);

Try also the FmtJapan, since the documentation says: The Spreadsheet::ParseExcel::FmtJapan formatter also supports Unicode. If you encounter any encoding problems with the default formatter try that instead.

*UPDATE: I tried it by myself in a xls file with Greek characters but it didn't worked neither with FmtUnicode or FmtJapan . I then found this perlmonks post, used the provided My::Excel::FmtUTF8 module and worked successfully when printing the values of a cell with $cell->value().

like image 81
psxls Avatar answered Oct 21 '22 10:10

psxls


I have tried what you describe and this works correctly here, once I enable the utf-8 output. I would guess you either have a weird excel file (you should post an example somewhere), or that your terminal is badly configured.

Dealing with character set issues is hard, because your terminal can me confusing you. So it is always a good idea to pipe the output into 'od -c' to see what you are getting. In my script I get this text from a spreadsheet I had lying around:

Value       = Descripción

And when I pipe it through od:

0000000   V   a   l   u   e                               =       D   e
0000020   s   c   r   i   p   c   i 303 263   n  \n

I can see that the ó is two bytes long, which suggests is UTF-8. To make sure, you can ask iconv to convert from the expected output charset to whatever you are using in your terminal:

iconv -f utf-8

If the input is not proper utf-8 it will bark at you and/or output even weirder garbage.

like image 29
Martina Ferrari Avatar answered Oct 21 '22 11:10

Martina Ferrari