Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make the output from Text::CSV utf8?

I have a CSV file, say win.csv, whose text is encoded in windows-1252. First I use iconv to make it in utf8.

$iconv -o test.csv -f windows-1252 -t utf-8 win.csv

Then I read the converted CSV file with the following Perl script (utfcsv.pl).

#!/usr/bin/perl 
use utf8;
use Text::CSV;
use Encode::Detect::Detector;

my $csv = Text::CSV->new({ binary => 1, sep_char => ';',});
open my $fh, "<encoding(utf8)", "test.csv";

while (my $row = $csv->getline($fh)) { 
  my $line = join " ", @$row;
  my $enc = Encode::Detect::Detector::detect($line);
  print "($enc) $line\n";
}

$csv->eof || $csv->error_diag();
close $fh;
$csv->eol("\r\n");
exit;

Then the output is like the following.

(UFT-8) .........
() .....

Namely the encoding of all lines are detected as UTF-8 (or ASCII). But the actual output does not seem to be UTF-8. In fact, if I save the output on a file

$./utfcsv.pl > output.txt

then the encoding of output.txt is detected as windows-1252.

Question: How can I get the output text in UFT-8?

Notes:

  1. Environment: openSUSE 13.2 x86_64, perl 5.20.1
  2. I do not use Text::CSV::Encoded because the installation fails. (Because test.csv is converted in UTF-8, so it is strange to use Text::CSV::Encoded.)
  3. I use the following script to check the encoding. (I also use it to find out the encoding of the initial CSV file win.csv.)

.

#!/usr/bin/perl 
use Encode::Detect::Detector;
open my $in,  "<","$ARGV[0]" || die "open failed";
while (my $line = <$in>) {
  my $enc = Encode::Detect::Detector::detect($line);
  chomp $enc;
  if ($enc) {
    print "$enc\n";
  }
}
like image 319
H. Shindoh Avatar asked Dec 11 '22 23:12

H. Shindoh


1 Answers

You have set the encoding of the input file handle (which, by the way, should be <:encoding(utf8) -- note the colon) but you haven't specified the encoding of the output channel, so Perl will send unencoded character values to the output

The Unicode values for characters that will fit in a single byte -- Basic Latin (ASCII) between 0 and 0x7F, and Latin-1 Supplement between 0x80 and 0xFF -- are very similar to Windows code page 1252. In particular a small letter u with a diaresis is 0xFC in both Unicode and CP1252, so the text will look like CP1252 if it is output unencoded, instead of the two-byte sequence 0xC3 0xBC which is the same codepoint encoded in UTF-8

If you use binmode on STDOUT to set the encoding then the data will be output correctly, but it is simplest to use the open pragma like this

use open qw/ :std :encoding(utf-8) /;

which will set the encoding for STDIN, STDOUT and STDERR, as well as any newly-opened file handles. That means you don't have to specify it when you open the CSV file, and your code will look like this

Note that I have also added use strict and use warnings, which are essential in any Perl program. I have also used autodie to remove the need for checks on the status of all IO operations, and I have taken advantage of the way Perl interpolates arrays inside double quotes by putting a space between the elements which avoids the need for a join call

#!/usr/bin/perl

use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf-8) /;
use autodie;

use Text::CSV;

my $csv = Text::CSV->new({ binary => 1, sep_char => ';' });

open my $fh, '<', 'test.csv';

while ( my $row = $csv->getline($fh) ) {
    print "@$row\n";
}

close $fh;
like image 86
Borodin Avatar answered Jan 04 '23 09:01

Borodin