Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I treat command-line arguments as UTF-8 in Perl?

How do I treat the elements of @ARGV as UTF-8 in Perl?

Currently I'm using the following work-around ..

use Encode qw(decode encode);

my $foo = $ARGV[0];
$foo = decode("utf-8", $foo);

.. which works but is not very elegant.

I'm using Perl v5.8.8 which is being called from bash v3.2.25 with a LANG set to en_US.UTF-8.

like image 627
knorv Avatar asked Jan 10 '10 15:01

knorv


People also ask

How do you make a UTF-8 terminal?

Enable the new UTF-8 option in Windows settings. Go to the language settings, click Administrative language settings, then Change system locale… and tick the Beta: Use Unicode UTF-8 for worldwide language support option. Restart your computer.

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.

What is $# ARGV in Perl?

$ARGV. contains the name of the current file when reading from <>. @ARGV. The array ARGV contains the command line arguments intended for the script. Note that $#ARGV is the generally number of arguments minus one, since $ARGV[0] is the first argument, NOT the command name.


2 Answers

Outside data sources are tricky in Perl. For command-line arguments, you're probably getting them as the encoding specified in your locale. Don't rely on your locale to be the same as someone else who might run your program.

You have to find out what that is then convert to Perl's internal format. Fortunately, it's not that hard.

The I18N::Langinfo module has the stuff you need to get the encoding:

    use I18N::Langinfo qw(langinfo CODESET);
    my $codeset = langinfo(CODESET);

Once you know the encoding, you can decode them to Perl strings:

    use Encode qw(decode);
    @ARGV = map { decode $codeset, $_ } @ARGV;

Although Perl encodes internal strings as UTF-8, you shouldn't ever think or know about that. You just decode whatever you get, which turns it into Perl's internal representation for you. Trust that Perl will handle everything else. When you need to store the data, ensure that you use the encoding you like.

If you know that your setup is UTF-8 and the terminal will give you the command-line arguments as UTF-8, you can use the A option with Perl's -C switch. This tells your program to assume the arguments are encoded as UTF-8:

% perl -CA program
like image 97
brian d foy Avatar answered Oct 20 '22 19:10

brian d foy


Use Encode::Locale:

use Encode::Locale;

decode_argv Encode::FB_CROAK;

This works, also on Win32, pretty OK for me.

like image 35
MichielB Avatar answered Oct 20 '22 21:10

MichielB