Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What the heck is a Perl string anyway?

I can't find a basic description of how string data is stored in Perl! Its like all the documentation is assuming I already know this for some reason. I know about encode(), decode(), and I know I can read raw bytes into a Perl "string" and output them again without Perl screwing with them. I know about open modes. I also gather Perl must use some interal format to store character strings and can differentiate between character and binary data. Please where is this documented???

Equivalent question is; given this perl:

$x = decode($y);

Decode to WHAT and from WHAT??

As far as I can figure there must be a flag on the string data structure that says this is binary XOR character data (of some internal format which BTW is a superset of Unicode -http://perldoc.perl.org/Encode.html#DESCRIPTION). But I'd like it if that were stated in the docs or confirmed/discredited here.

like image 897
spinkus Avatar asked Mar 02 '13 05:03

spinkus


4 Answers

This is a great question. To investigate, we can dive a little deeper by using Devel::Peek to see what is actually stored in our strings (or other variables).

First lets start with an ASCII string

$ perl -MDevel::Peek -E 'Dump "string"'
SV = PV(0x9688158) at 0x969ac30
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK)
  PV = 0x969ea20 "string"\0
  CUR = 6
  LEN = 12

Then we can turn on unicode IO layers and do the same

$ perl -MDevel::Peek -CSAD -E 'Dump "string"'
SV = PV(0x9eea178) at 0x9efcce0
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK)
  PV = 0x9f0faf8 "string"\0
  CUR = 6
  LEN = 12

From there lets try to manually add some wide characters

$ perl -MDevel::Peek -CSAD -e 'Dump "string \x{2665}"'
SV = PV(0x9be1148) at 0x9bf3c08
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK,UTF8)
  PV = 0x9bf7178 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
  CUR = 10
  LEN = 12

From that you can clearly see that Perl has interpreted this correctly as utf8. The problem is that if I don't give the octets using the \x{} escaping the representation looks more like the regular string

$ perl -MDevel::Peek -CSAD -E 'Dump "string ♥"'
SV = PV(0x9143058) at 0x9155cd0
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK)
  PV = 0x9168af8 "string \342\231\245"\0
  CUR = 10
  LEN = 12

All Perl sees is bytes and has no way to know that you meant them as a unicode character, unlike when you entered the escaped octets above. Now lets use decode and see what happens

$ perl -MDevel::Peek -CSAD -MEncode=decode -E 'Dump decode "utf8", "string ♥"'
SV = PV(0x8681100) at 0x8683068
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x869dbf0 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
  CUR = 10
  LEN = 12

TADA!, now you can see that the string is correctly internally represented matching what you entered when you used the \x{} escaping.

The actual answer is it is "decoding" from bytes to characters, but I think it makes more sense when you see the Peek output.

Finally, you can make Perl see you source code as utf8 by using the utf8 pragma, like so

$ perl -MDevel::Peek -CSAD -Mutf8 -E 'Dump "string ♥"'
SV = PV(0x8781170) at 0x8793d00
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK,UTF8)
  PV = 0x87973b8 "string \342\231\245"\0 [UTF8 "string \x{2665}"]
  CUR = 10
  LEN = 12
like image 151
Joel Berger Avatar answered Nov 11 '22 19:11

Joel Berger


Rather like the fluid string/number status of its scalar variables, the internal format of Perl's strings is variable and depends on the contents of the string.

Take a look at perluniintro, which says this.

Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8.

What that means is that a string like "I have £ two" is stored as (bytes) I have \x{A3} two. (The pound sign is U+00A3.) Now if I append a multi-byte unicode string such as U+263A - a smiling face - Perl will convert the whole string to UTF-8 before it appends the new character, giving (bytes) I have \xC2\xA3 two\xE2\x98\xBA. Removing this last character again leaves the string UTF-8 encoded, as `I have \xC2\xA3 two.

But I wonder why you need to know this. Unless you are writing an XS extension in C the internal format is transparent and invisible to you.

like image 38
Borodin Avatar answered Nov 11 '22 19:11

Borodin


Perls internal string format is implementation dependant, but usually a super set of UtF-8. It doesn't matter what it is because you use decode and encode to convert strings to and from the internal format to other encodings.

Decode converts to perls internal format, encode converts from perls internal format.

Binary data is stored internaly the same way characters 0 through 255 are.

Encode and decode just convert between formats. For example UTF8 encoding means each character will only be an octet using perl character vlaues 0 through 255, ie that the string consists of UTF8 octets.

like image 2
Myforwik Avatar answered Nov 11 '22 19:11

Myforwik


Short answer: It's a mess
Slightly longer: The difference isn't visible to the programmer.

Basically you have to remember if your string contains bytes or characters, where characters are unicode codepoints. If you only encounter ASCII, the difference is invisible, which is dangerous.

Data itself and the representation of such data are distinct, and should not be confused. Strings are (conceptually) a sequence of codepoints, but are represented as a byte array in memory, and represented as some byte sequence when encoded. If you want to store binary data in a string, you re-interpret the number of a codepoint as a byte value, and restrict yourself to codepoints in 0–255.

(E.g. a file has no encoding. The information in that file has some encoding (be it ASCII, UTF-16 or EBCDIC at a character level, and Perl, HTML or .ini at an application level))

The exact storage format of a string is irrelevant, but you can store complete integers inside such a string:

# this will work if your perl was compiled with large integers
my $string = chr 2**64; # this is so not unicode
say ord $string; # 18446744073709551615

The internal format is adjusted accordingly to accomodate such values; normal strings won't take up one integer per character.

like image 2
amon Avatar answered Nov 11 '22 18:11

amon