Here's a simple perl script that is supposed to write a utf-8 encoded file:
use warnings;
use strict;
open (my $out, '>:encoding(utf-8)', 'tree.out') or die;
print $out readpipe ('tree ~');
close $out;
I have expected readpipe to return a utf-8 encoded string since LANG is set toen_US.UTF-8. However, looking at tree.out (while making sure the editor recognizes it a as utf-8 encoded) shows me all garbled text.
If I change the >:encoding(utf-8) in the open statement to >:encoding(latin-1), the script creates a utf-8 file with the expected
text.
This is all a bit strange to me. What is the explanation for this behavior?
readpipe is returning to perl a string of undecoded bytes.  We know that that string is UTF-8 encoded, but you've not told Perl.
The IO layer on your output handle is taking that string, assuming it is Unicode code-points and re-encoding them as UTF-8 bytes.
The reason that the latin-1 IO layer appears to be functioning correctly is that it is writing out each undecoded byte unmolested because the 1st 256 unicode code-points correspond nicely with latin-1.
The proper thing to do would be to decode the byte-string returned by readpipe into a code-point-string, before feeding it to an IO-layer.  The statement use open ':utf8', as mentioned by Borodin, should be a viable solution as readpipe is specifically mentioned in the open manual page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With