In the following:
my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n";
The x{FB01}
and x{E9}
are code points. And code points are encoded via an encoding scheme to a series of octets.
So the character è
which has the codepoint \x{FB01}
is part of the string of $string
. But how does this work? Are all the characters in this sentence (including the ASCII ones) encoded via UTF-8
?
If yes why do I get the following behavior?
my $str = "Some arbitrary string\n";
if(Encode::is_utf8($str)) {
print "YES str IS UTF8!\n";
}
else {
print "NO str IT IS NOT UTF8\n";
}
This prints "NO str IT IS NOT UTF8\n"
Additionally Encode::is_utf8($string)
returns true
.
In what way are $string
and $str
different and one is considered UTF-8
and the other not?
And in any case what is the encoding of $str
? ASCII? Is this the default for Perl
?
UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
In order to encode/decode a string in JavaScript, We are using built-in functions provided by JavaScript. btoa(): This method encodes a string in base-64 and uses the “A-Z”, “a-z”, “0-9”, “+”, “/” and “=” characters to encode the provided string.
Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes. The process is known as encoding.
In C, a string is a collection of octets, but Perl has two string storage formats:
As such, you don't need to encode code points to store them in a string.
my $s = "\x{2660}\x{2661}";
say length $s; # 2
say sprintf '%X', ord substr($s, 0, 1); # 2660
say sprintf '%X', ord substr($s, 1, 1); # 2661
(Internally, an extension of UTF-8 called "utf8" is used to store the strings of 72-bit chars. That's not something you should ever have to know except to realize the performance implications, but there are bugs that expose this fact.)
Encode's is_utf8
reports which type of string a scalar contains. It's a function that serves absolutely no use except to debug the bugs I previously mentioned.
"abc"
(or the string in the OP's $str
), so Perl used the more efficient 8-bit (UTF8=0) string format."\x{2660}\x{2661}"
(or the string in the OP's $string
), so Perl used the 72-bit (UTF8=1) string format.Zero is zero whether it's stored in a floating point number, a signed integer or an unsigned integer. Similarly, the storage format of strings conveys no information about the value of the string.
In fact, Perl will switch between the two formats at will. For example, if you concatenate $string
with $str
, you'll get a string in the 72-bit format.
You can alter the storage format of a string with the builtins utf8::downgrade
and utf8::upgrade
, should you ever need to work around a bug.
utf8::downgrade($s); # Switch to strings of 8-bit values (UTF8=0).
utf8::upgrade($s); # Switch to strings of 72-bit values (UTF8=1).
You can see the effect using Devel::Peek.
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::downgrade($s); Dump($s);"
SV = PV(0x7b8a74) at 0x4a84c4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7bab9c "\200"\0
CUR = 1
LEN = 12
>perl -MDevel::Peek -e"$s=chr(0x80); utf8::upgrade($s); Dump($s);"
SV = PV(0x558a6c) at 0x1cc843c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55ab94 "\302\200"\0 [UTF8 "\x{80}"]
CUR = 2
LEN = 12
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With