I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}
.
So my script should be callable in these two ways, as usual, giving the same output:
./script.pl utf8.txt cat utf8.txt | ./script.pl
But the outputs differ! Only the second call (using cat
) seems to work as designed, reading UTF-8 properly. Here is the script:
#!/usr/bin/perl -w binmode STDIN, ':utf8'; binmode STDOUT, ':utf8'; while(<>){ my @chars = split //, $_; print "$_\n" foreach(@chars); }
How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <>
for reading, if possible.
EDIT:
I realized I should probably describe the different outputs. My input file contains this sequence: a\xCA\xA7b
. The method with cat
correctly outputs:
a \xCA\xA7 b
But the other method gives me this:
a \xC3\x8A \xC2\xA7 b
Try to use the pragma open instead:
use strict; use warnings; use open qw(:std :utf8); while(<>){ my @chars = split //, $_; print "$_" foreach(@chars); }
You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in @ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in @ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when @ARGV has files. In this case the <> operator opens a new file handle for each file in @ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With