Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Encode::Guess tell utf-8 from iso-8859-1?

I have a string $data, encoded in utf-8. I assume that I don't know whether this string is utf-8 or iso-8859-1. I want to use the Perl Encode::Guess module to see if it's one or the other. I'm having trouble figuring out how this module works.

I have tried the four following methods (from http://perldoc.perl.org/Encode/Guess.html) :

use Encode::Guess qw/utf8 latin1/;

my $decoder = guess_encoding($data);

print "$decoder\n";

Result: iso-8859-1 or utf8

use Encode::Guess qw/utf8 latin1/;

my $enc = guess_encoding($data, qw/utf8 latin1/);
ref($enc) or die "Can't guess: $enc";
my $utf8 = $enc->decode($data); 

print "$utf8\n";

Result: Can't guess: iso-8859-1 or utf8 at encodage-windows.pl line 25, line 18110.

use Encode::Guess qw/utf8 latin1/;

my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);

print "$utf8\n";

Result: iso-8859-1 or utf8 at encodage-windows.pl line 30, line 18110.

use Encode::Guess qw/utf8 latin1/;

my $utf8 = Encode::decode("Guess", $data);

print "$utf8\n";

Result: iso-8859-1 or utf8 at /usr/local/lib/perl5/Encode.pm line 175.

My first question is: which one of these methods am I supposed to use (if any)? And my second question: what changes should I make to make this work?

like image 304
kormak Avatar asked Apr 11 '14 14:04

kormak


1 Answers

I normally check the possible encodings one at a time, like this

my $decoder = guess_encoding($data, 'utf8');
$decoder = guess_encoding($data, 'iso-8859-1') unless ref $decoder;
die $decoder unless ref $decoder;

printf "Decoding as %s\n\n", $decoder->name;
$data = $decoder->decode($data);

If possible it chooses UTF-8, otherwise it tries ISO-8859-1, and either chooses that or errors, so it becomes a simple yes/no result for each encoding and there is no way for it to come up with two possible results (which is the error you're getting).

like image 112
Borodin Avatar answered Sep 30 '22 01:09

Borodin