Here's a simple perl script that is supposed to write a utf-8 encoded file:
use warnings;
use strict;
open (my $out, '>:encoding(utf-8)', 'tree.out') or die;
print $out readpipe ('tree ~');
close $out;
I have expected readpipe to return a utf-8 encoded string since LANG
is set toen_US.UTF-8
. However, looking at tree.out
(while making sure the editor recognizes it a as utf-8 encoded) shows me all garbled text.
If I change the >:encoding(utf-8)
in the open statement to >:encoding(latin-1)
, the script creates a utf-8 file with the expected
text.
This is all a bit strange to me. What is the explanation for this behavior?
readpipe
is returning to perl a string of undecoded bytes. We know that that string is UTF-8 encoded, but you've not told Perl.
The IO layer on your output handle is taking that string, assuming it is Unicode code-points and re-encoding them as UTF-8 bytes.
The reason that the latin-1 IO layer appears to be functioning correctly is that it is writing out each undecoded byte unmolested because the 1st 256 unicode code-points correspond nicely with latin-1.
The proper thing to do would be to decode
the byte-string returned by readpipe
into a code-point-string, before feeding it to an IO-layer. The statement use open ':utf8'
, as mentioned by Borodin, should be a viable solution as readpipe
is specifically mentioned in the open
manual page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With