Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Perl, how can I can check if an encoding specified in a string is valid?

Say, I have a sub that receives two arguments: An encoding specification, and a file path. The sub then uses that information to open a file for reading as shown below, stripped down to its essentials:

run({
    encoding => 'UTF-16---LE',
    input_filename => 'test_file.txt',
});

sub run {
    my $args = shift;
    my ($enc, $fn) = @{ $args }{qw(encoding input_filename)};

    my $is_ok = open my $in,
        sprintf('<:encoding(%s)', $args->{encoding}),
        $args->{input_filename}
    ;
}

Now, this croaks with:

Cannot find encoding "UTF-16---LE" at E:\Home\...

What is the right way to ensure that $args->{encoding} holds a valid encoding specification before interpolating into the second argument to open?

Update

The information below is provided in the hope that it will be useful to someone at some point. I am also going to file a bug report.

The documents for Encode::Alias do not mention find_alias at all. A casual look at the Encode/Alias.pm on my Windows system reveals:

# Public, encouraged API is exported by default

our @EXPORT =
  qw (
  define_alias
  find_alias
);

However, note:

#!/usr/bin/env perl

use 5.014;
use Encode::Alias;
say find_alias('UTF-8')->name;

yields:

Use of uninitialized value $find in exists at C:/opt/Perl/lib/Encode/Alias.pm line 25. Use of uninitialized value $find in hash element at C:/opt/Perl/lib/Encode/Alias.pm line 26. Use of uninitialized value $find in pattern match (m//) at C:/opt/Perl/lib/Encode/Alias.pm line 31. Use of uninitialized value $find in lc at C:/opt/Perl/lib/Encode/Alias.pm line 40. Use of uninitialized value $find in pattern match (m//) at C:/opt/Perl/lib/Encode/Alias.pm line 31. Use of uninitialized value $find in lc at C:/opt/Perl/lib/Encode/Alias.pm line 40.

Being 1) lazy, and 2) first to assume I am doing something wrong, I decided to seek others' wisdom.

In any case, the bug is due to find_alias being exported as a function without checking for that in the code:

sub find_alias {
    require Encode;
    my $class = shift;
    my $find  = shift;
    unless ( exists $Alias{$find} ) {

If find_alias is not invoked as a method, the argument is now in $class and $find is undefined.

HTH.

like image 603
Sinan Ünür Avatar asked Mar 31 '12 13:03

Sinan Ünür


People also ask

How do I encode a String in Perl?

$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.

What encoding is a String?

String objects use UTF-16 encoding. The problem with UTF-16 is that it cannot be modified. There is only one way that can be used to get different encoding i.e. byte[] array.

How do I know if a character is UTF-8?

You do that by calling str. valid_encoding? on a String str that is in UTF-8 -encoding. Does that not get clear from my answer? Programmatically, you can not (or at least not easily and of course not reliably) check the invalidity of a string in a one-byte-encoding such as CP1252 .

Is UTF-8 a String?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names.


1 Answers

Encode::Alias->find_alias($encoding_name) returns an object whose name attribute is the canonical encoding name on success, and false on failure.

$ Encode::Alias->find_alias('UTF-16---LE')
$ Encode::Alias->find_alias('UTF-16 LE')
Encode::Unicode  {
    Parents       Encode::Encoding
    Linear @ISA   Encode::Unicode, Encode::Encoding
    public methods (6) : bootstrap, decode, decode_xs, encode, encode_xs, renew
    private methods (0)
    internals: {
        endian   "v",
        Name   "UTF-16LE",
        size   2,
        ucs2   ""
    }
}
$ Encode::Alias->find_alias('Latin9')
Encode::XS  {
    public methods (9) : cat_decode, decode, encode, mime_name, name, needs_lines, perlio_ok, renew, renewed
    private methods (0)
    internals: 140076283926592
}
$ Encode::Alias->find_alias('UTF-16 LE')->name
UTF-16LE
$ Encode::Alias->find_alias('Latin9')->name
iso-8859-15
like image 73
daxim Avatar answered Oct 13 '22 23:10

daxim