Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Perl how do I pass unicode arguments to external commands?

The root cause for this question is my attempt to write tests for a new option/argument processing module (OptArgs) for Perl. This of course involves parsing @ARGV which I am doing based on the answers to this question. This works fine on systems where I18N::Langinfo::CODESET is defined[1].

On systems where langinfo(CODESET) is not available I would like to at least make a best effort based on observed behaviour. However my tests so far indicate that some systems I cannot even pass a unicode argument to an external script properly.

I have managed to run something like the following on various systems where "test_script" is a Perl script that merely does a print Dumper(@ARGV):

use utf8;
my $utf8   = '¥';
my $result = qx/$^X test_script $utf8/;

What I have found is that on FreeBSD the test_script receives bytes which can be decoded into Perl's internal format. However on OpenBSD and Solaris test_script appears to get the string "\x{fffd}\x{fffd}" which contains only the unicode replacement character (twice?).

I don't know the mechanism underlying the qx operator. I presume it either exec's or shells out, but unlike filehandles (where I can binmode them for encoding) I don't know how to ensure it does what I want. Same with system() for that matter. So my question is what am I not doing correctly above? Otherwise what is different with Perl or the shell or the environment on OpenBSD and Solaris?

[1] Actually I think so far that is only Linux according to CPAN testers results.

Update(x2): I currently have the following running its way through cpantester's setups to test Schwern's hypothesis:

use strict;
use warnings;
use Data::Dumper;

BEGIN {
    if (@ARGV) {
        require Test::More;
        Test::More::diag( "\npre utf8::all: "
              . Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
    }
}

use utf8;
use utf8::all;

BEGIN { 
    if (@ARGV) {
        Test::More::diag( "\npost utf8::all: "
              . Dumper( { utf8 => $ARGV[0], bytes => $ARGV[1] } ) );
        exit;
    }
}

use Encode;
use Test::More;

my $builder = Test::More->builder;
binmode $builder->output,         ':encoding(UTF-8)';
binmode $builder->failure_output, ':encoding(UTF-8)';
binmode $builder->todo_output,    ':encoding(UTF-8)';

my $utf8  = '¥';
my $bytes = encode_utf8($utf8);

diag( "\nPassing: " . Dumper( { utf8 => $utf8, bytes => $bytes, } ) );

open( my $fh, '-|', $^X, $0, $utf8, $bytes ) || die "open: $!";
my $result = join( '', <$fh> );
close $fh;

ok(1);
done_testing();

I'll post the results on various systems when they come through. Any comments on the validity andor correctness of this would be apprecicated. Note that it is not intended to be a valid test. The purpose of the above is to be able to compare what is received on different systems.

Resolution: The real underlying issue turns out to be something not addressed in my question nor by Schwern's answer below. What I discovered is that some cpantesters machines only have an ascii locale installed/available. I should not expect any attempt to pass UTF-8 characters to programs in this type of environment to work. So in the end my problem was invalid test conditions, not invalid code.

I have seen nothing so far to indicate that the qx operator or the utf8::all module have any effect on how parameters are passed to external programs. The critical component appears to be the LANG and/or LC_ALL environment variables, to inform the external program what locale they are running in.

By the way, my original assertion that my code was working on all systems where I18N::Langinfo::CODESET is defined was incorrect.

like image 737
Mark Lawrence Avatar asked Jun 20 '12 01:06

Mark Lawrence


1 Answers

qx makes a call to the shell and it may be interfering.

To avoid that, use utf8::all to switch on all the Perl Unicode voodoo. Then use the open function to open a pipe to your program, avoiding the shell.

use utf8::all;
my $utf8   = '¥';

open my $read_from_script, "-|", "test_script", $utf8;
print <$read_from_script>,"\n";
like image 63
Schwern Avatar answered Oct 20 '22 20:10

Schwern