The root cause for this question is my attempt to write tests for a new option/argument processing module (OptArgs) for Perl. This of course involves parsing @ARGV
which I am doing based on the answers to this question. This works fine on systems where I18N::Langinfo::CODESET is defined[1].
On systems where langinfo(CODESET)
is not available I would like to at least make a best effort based on observed behaviour. However my tests so far indicate that some systems I cannot even pass a unicode argument to an external script properly.
I have managed to run something like the following on various systems where "test_script" is a Perl script that merely does a print Dumper(@ARGV)
:
use utf8;
my $utf8 = '¥';
my $result = qx/$^X test_script $utf8/;
What I have found is that on FreeBSD the test_script receives bytes which can be decoded into Perl's internal format. However on OpenBSD and Solaris test_script appears to get the string "\x{fffd}\x{fffd}"
which contains only the unicode replacement character (twice?).
I don't know the mechanism underlying the qx
operator. I presume it either exec
's or shells out, but unlike filehandles (where I can binmode them for encoding) I don't know how to ensure it does what I want. Same with system()
for that matter. So my question is what am I not doing correctly above? Otherwise what is different with Perl or the shell or the environment on OpenBSD and Solaris?
[1] Actually I think so far that is only Linux according to CPAN testers results.
Update(x2): I currently have the following running its way through cpantester's setups to test Schwern's hypothesis:
use strict;
use warnings;
use Data::Dumper;
BEGIN {
if (@ARGV) {
require Test::More;
Test::More::diag( "\npre utf8::all: "
. Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
}
}
use utf8;
use utf8::all;
BEGIN {
if (@ARGV) {
Test::More::diag( "\npost utf8::all: "
. Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
exit;
}
}
use Encode;
use Test::More;
my $builder = Test::More->builder;
binmode $builder->output, ':encoding(UTF-8)';
binmode $builder->failure_output, ':encoding(UTF-8)';
binmode $builder->todo_output, ':encoding(UTF-8)';
my $utf8 = '¥';
my $bytes = encode_utf8($utf8);
diag( "\nPassing: " . Dumper( { utf8 => $utf8, bytes => $bytes, } ) );
open( my $fh, '-|', $^X, $0, $utf8, $bytes ) || die "open: $!";
my $result = join( '', <$fh> );
close $fh;
ok(1);
done_testing();
I'll post the results on various systems when they come through. Any comments on the validity andor correctness of this would be apprecicated. Note that it is not intended to be a valid test. The purpose of the above is to be able to compare what is received on different systems.
Resolution: The real underlying issue turns out to be something not addressed in my question nor by Schwern's answer below. What I discovered is that some cpantesters machines only have an ascii locale installed/available. I should not expect any attempt to pass UTF-8 characters to programs in this type of environment to work. So in the end my problem was invalid test conditions, not invalid code.
I have seen nothing so far to indicate that the qx
operator or the utf8::all
module have any effect on how parameters are passed to external programs. The critical component appears to be the LANG
and/or LC_ALL
environment variables, to inform the external program what locale they are running in.
By the way, my original assertion that my code was working on all systems where I18N::Langinfo::CODESET is defined was incorrect.
qx
makes a call to the shell and it may be interfering.
To avoid that, use utf8::all to switch on all the Perl Unicode voodoo. Then use the open
function to open a pipe to your program, avoiding the shell.
use utf8::all;
my $utf8 = '¥';
open my $read_from_script, "-|", "test_script", $utf8;
print <$read_from_script>,"\n";
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With