Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I guess the encoding of a string in Perl?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?

I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:

my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''}); 

How will the above work when no input encoding is specified? Should it be specified like the following?

my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''}); 
like image 948
Maulin Avatar asked Dec 28 '09 17:12

Maulin


People also ask

How do I know the encoding of a text?

Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.

How do I encode a string in Perl?

$octets = encode(ENCODING, $string [, CHECK]) Encodes a string from Perl's internal form into ENCODING and returns a sequence of octets. ENCODING can be either a canonical name or an alias. For encoding names and aliases, see Defining Aliases. For CHECK, see Handling Malformed Data.

What is the encoding of a string?

String objects use UTF-16 encoding. The problem with UTF-16 is that it cannot be modified. There is only one way that can be used to get different encoding i.e. byte[] array.


2 Answers

To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)

use Encode::Detect::Detector; my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".               "\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".               "\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".               "\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}"; my $encoding_name = Encode::Detect::Detector::detect($unknown); print $encoding_name; # gb18030  use Encode; my $string = decode($encoding_name, $unknown); 

I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.

# Bad example; don't do this. use utf8; use Encode; my $string = 'This year I went to 北京 Perl workshop.'; print encode('ascii', $string); # This year I went to ?? Perl workshop. 

If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.

use utf8; use Text::Unidecode; my $string = 'This year I went to 北京 Perl workshop.'; print unidecode($string); # This year I went to Bei Jing  Perl workshop. 

However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.

use utf8; use Encode qw(encode PERLQQ XMLCREF); my $string = 'This year I went to 北京 Perl workshop.'; print encode('ascii', $string, PERLQQ);  # This year I went to \x{5317}\x{4eac} Perl workshop. print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop. 
like image 136
daxim Avatar answered Sep 20 '22 02:09

daxim


The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:

 use Encode;   my $a_with_ring =    eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }      or die "Could not decode string: $@"; 

This has the drawback that the same octet sequence can be valid in multiple encodings

I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)

You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.

like image 38
brian d foy Avatar answered Sep 20 '22 02:09

brian d foy