Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl's default string encoding and representation

In the following:

my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";

The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è which has the codepoint \x{FB01} is part of the string of $string. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8?
If yes why do I get the following behavior?

my $str = "Some arbitrary string\n";  

if(Encode::is_utf8($str)) {  
        print "YES str IS UTF8!\n";  
}  
else {  
        print "NO str IT IS NOT UTF8\n";   
}  

This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string) returns true.
In what way are $string and $str different and one is considered UTF-8 and the other not?
And in any case what is the encoding of $str? ASCII? Is this the default for Perl?

like image 788
Cratylus Avatar asked Jun 20 '13 19:06

Cratylus


People also ask

What UTF-8 means?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.

What are UTF-8 strings?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

What encoding is string in JavaScript?

In order to encode/decode a string in JavaScript, We are using built-in functions provided by JavaScript. btoa(): This method encodes a string in base-64 and uses the “A-Z”, “a-z”, “0-9”, “+”, “/” and “=” characters to encode the provided string.

What is a string encoding?

Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes. The process is known as encoding.


1 Answers

In C, a string is a collection of octets, but Perl has two string storage formats:

  • String of 8-bit values.
  • String of 72-bit values. (In practice, limited to 32-bit or 64-bit.)

As such, you don't need to encode code points to store them in a string.

my $s = "\x{2660}\x{2661}";
say length $s;                            # 2
say sprintf '%X', ord substr($s, 0, 1);   # 2660
say sprintf '%X', ord substr($s, 1, 1);   # 2661

(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)

Encode's is_utf8 reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.

  • An 8-bit string can store the value of "abc" (or the string in the OP's $str), so Perl used the more efficient 8-bit (UTF8=0) string format.
  • An 8-bit string can't store the value of "\x{2660}\x{2661}" (or the string in the OP's $string), so Perl used the 72-bit (UTF8=1) string format.

Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.

  • You can store code points in an 8-bit string (if they're small enough) just as easily as a 72-bit string.
  • You can store bytes in a 72-bit string just as easily as an 8-bit string.

In fact, Perl will switch between the two formats at will. For example, if you concatenate $string with $str, you'll get a string in the 72-bit format.

You can alter the storage format of a string with the builtins utf8::downgrade and utf8::upgrade, should you ever need to work around a bug.

utf8::downgrade($s);  # Switch to strings of  8-bit values (UTF8=0).
utf8::upgrade($s);    # Switch to strings of 72-bit values (UTF8=1).

You can see the effect using Devel::Peek.

>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x7bab9c "\200"\0
  CUR = 1
  LEN = 12

>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
  CUR = 2
  LEN = 12
like image 141
ikegami Avatar answered Sep 23 '22 17:09

ikegami