How do I read UTF-8 with diamond operator ()?

Question

I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}.

So my script should be callable in these two ways, as usual, giving the same output:

./script.pl utf8.txt cat utf8.txt | ./script.pl

But the outputs differ! Only the second call (using cat) seems to work as designed, reading UTF-8 properly. Here is the script:

#!/usr/bin/perl -w  binmode STDIN, ':utf8'; binmode STDOUT, ':utf8';  while(<>){     my @chars = split //, $_;     print "$_
" foreach(@chars); }

How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <> for reading, if possible.

EDIT:

I realized I should probably describe the different outputs. My input file contains this sequence: a\xCA\xA7b. The method with cat correctly outputs:

a \xCA\xA7 b

But the other method gives me this:

a \xC3\x8A \xC2\xA7 b

Emmanuel Rodriguez · Accepted Answer

Try to use the pragma open instead:

use strict; use warnings; use open qw(:std :utf8);  while(<>){     my @chars = split //, $_;     print "$_" foreach(@chars); }

You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in @ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in @ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when @ARGV has files. In this case the <> operator opens a new file handle for each file in @ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.

How do I read UTF-8 with diamond operator (<>)?

Tags:

input

unicode

utf-8

perl

Frank

1 Answers

Emmanuel Rodriguez

Recent Activity

Donate For Us