Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I work with raw bytes in perl

Documentation all directs me to unicode support, yet I don't think my request has anything to do with Unicode. I want to work with raw bytes within the context of a single scalar; I need to be able to figure out its length (in bytes), take substrings of it (in bytes), write the bytes to disc, and over the network. Is there an easy way to do this, without treating the bytes as any sort of encoding in perl?

EDIT

More explicitly,

my $data = "Perl String, unsure of encoding and don't need to know";
my @data_chunked_into_1024_bytes_each = #???
like image 456
Cory Kendall Avatar asked Feb 19 '23 03:02

Cory Kendall


2 Answers

Perl strings are, conceptually, strings of characters, which are positive 32-bit integers that (normally) represent Unicode code points. A byte string, in Perl, is just a string in which all the characters have values less than 256.

(That's the conceptual view. The internal representation is somewhat more complicated, as the perl interpreter tries to store byte strings — in the above sense — as actual byte strings, while using a generalized UTF-8 encoding for strings that contain character values of 256 or higher. But this is all supposed to be transparent to the user, and in fact mostly is, except for some ugly historical corner cases like the bitwise not (~) operator.)

As for how to turn a general string into a byte string, that really depends on what the string you have contains and what the byte string is supposed to contain:

  • If your string already is a string of bytes — e.g. if you read it from a file in binary mode — then you don't need to do anything. The string shouldn't contain any characters above 255 to being with, and if it does, that's an error and will probably be reported as such by the encryption code.

  • Similarly, if your string is supposed to encode text in the ASCII or ISO-8859-1 encodings (which encode the 7- and 8-bit subsets of Unicode respectively), then you don't need to do anything: any characters up to 255 are already correctly encoded, and any higher values are invalid for those encodings.

  • If your input string contains (Unicode) text that you want to encode in some other encoding, then you'll need to convert the string to that encoding. The usual way to do that is by using the Encode module, like this:

    use Encode;
    my $byte_string = encode( "name of encoding", $text_string );
    

    Obviously, you can convert the byte string back to the corresponding character string with:

    use Encode;
    my $text_string = decode( "name of encoding", $byte_string );
    
  • For the special case of the UTF-8 encoding, it's also possible to use the built-in utf8::encode() function instead of Encode::encode():

    utf8::encode( $string );
    

    which does essentially the same thing as:

    use Encode;
    $string = encode( "utf8", $string );
    

    Note that, unlike Encode::encode(), the utf8::encode() function modifies the input string directly. Also note that the "utf8" above refers to Perl's extended UTF-8 encoding, which allows values outside the official Unicode range; for strictly standards-compliant UTF-8 encoding, use "utf-8" with a hyphen (see Encode documentation for the gory details). And, yes, there's also a utf8::decode() function that does pretty much what you'd expect.

like image 165
Ilmari Karonen Avatar answered Feb 24 '23 18:02

Ilmari Karonen


If I understood your question correctly, what you want is the pack/unpack functions: http://perldoc.perl.org/functions/pack.html

like image 43
Bitwise Avatar answered Feb 24 '23 18:02

Bitwise