I have a CSV file, say win.csv, whose text is encoded in windows-1252. First I use iconv to make it in utf8.
$iconv -o test.csv -f windows-1252 -t utf-8 win.csv
Then I read the converted CSV file with the following Perl script (utfcsv.pl).
#!/usr/bin/perl
use utf8;
use Text::CSV;
use Encode::Detect::Detector;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';',});
open my $fh, "<encoding(utf8)", "test.csv";
while (my $row = $csv->getline($fh)) {
my $line = join " ", @$row;
my $enc = Encode::Detect::Detector::detect($line);
print "($enc) $line\n";
}
$csv->eof || $csv->error_diag();
close $fh;
$csv->eol("\r\n");
exit;
Then the output is like the following.
(UFT-8) .........
() .....
Namely the encoding of all lines are detected as UTF-8 (or ASCII). But the actual output does not seem to be UTF-8. In fact, if I save the output on a file
$./utfcsv.pl > output.txt
then the encoding of output.txt is detected as windows-1252.
Question: How can I get the output text in UFT-8?
Notes:
.
#!/usr/bin/perl
use Encode::Detect::Detector;
open my $in, "<","$ARGV[0]" || die "open failed";
while (my $line = <$in>) {
my $enc = Encode::Detect::Detector::detect($line);
chomp $enc;
if ($enc) {
print "$enc\n";
}
}
You have set the encoding of the input file handle (which, by the way, should be <:encoding(utf8)
-- note the colon) but you haven't specified the encoding of the output channel, so Perl will send unencoded character values to the output
The Unicode values for characters that will fit in a single byte -- Basic Latin (ASCII) between 0 and 0x7F, and Latin-1 Supplement between 0x80 and 0xFF -- are very similar to Windows code page 1252. In particular a small letter u with a diaresis is 0xFC in both Unicode and CP1252, so the text will look like CP1252 if it is output unencoded, instead of the two-byte sequence 0xC3 0xBC which is the same codepoint encoded in UTF-8
If you use binmode
on STDOUT to set the encoding then the data will be output correctly, but it is simplest to use the open
pragma like this
use open qw/ :std :encoding(utf-8) /;
which will set the encoding for STDIN, STDOUT and STDERR, as well as any newly-opened file handles. That means you don't have to specify it when you open the CSV file, and your code will look like this
Note that I have also added use strict
and use warnings
, which are essential in any Perl program. I have also
used autodie
to remove the need for checks on the status of all IO operations, and I have taken advantage of the way Perl interpolates arrays inside double quotes by putting a space between the elements which avoids the need for a join
call
#!/usr/bin/perl
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf-8) /;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';' });
open my $fh, '<', 'test.csv';
while ( my $row = $csv->getline($fh) ) {
print "@$row\n";
}
close $fh;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With