How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n>
and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
A string in Perl is a scalar variable and start with a ($) sign and it can contain alphabets, numbers, special characters. The string can consist of a single word, a group of words or a multi-line paragraph. The String is defined by the user within a single quote (') or double quote (“).
Perl | substr() function This function by default returns the remaining part of the string starting from the given index if the length is not specified. A replacement string can also be passed to the substr() function if you want to replace that part of the string with some other substring.
There are different types of string operators in Perl, as follows: Concatenation Operator (.) Repetition Operator (x) Auto-increment Operator (++)
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale
.
In your sample you call binmode on your input handle changing it to use :utf8
semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print
writes to STDOUT
by default, and STDOUT
defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT
can handle UTF-8 you can use binmode(STDOUT, ':utf8');
to tell Perl you want any data sent to STDOUT
to be sent as UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With