Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sending binary-safe data over the network in Perl

I'm implementing a network client that sends messages to a server. The messages are streams of bytes, and the protocol requires that I send the length of each stream beforehand.

If the message that I am given (by the code using my module) is a byte string, then the length is given easily enough by length $string. But if it's a string of characters, I'll need to massage it to get the raw bytes. What I'm doing now is basically this:

my $msg = shift;   # some message from calling code
my $bytes;
if ( utf8::is_utf8( $msg ) ) { 
    $bytes = Encode::encode( 'utf-8', $msg );
} else { 
    $bytes = $msg;
}

my $length = length $bytes;

Is this the correct way to handle this? It seems to work so far, but I haven't done any serious testing yet. What potential pitfalls are there with this approach?

Thanks

like image 811
friedo Avatar asked Oct 14 '11 18:10

friedo


2 Answers

You shouldn't really be guessing at what your input is. Define your code to accept either byte strings or Unicode character strings, and leave it to the caller to convert the input to the proper format (or provide some way for the caller to specify which kind of strings they're providing).

If you define your code to accept byte strings, then any characters above \xFF are an error.

If you define your code to accept Unicode character strings, then you can convert them to bytes with Encode::encode_utf8() (and should do so regardless of how they're internally represented by Perl).

In any case, calling utf8::is_utf8() is usually a mistake — your program should not care about the internal representation of strings, only about the actual data (a sequence of characters) they contain. Whether some of those characters (in particular, those in the range \x80 to \xFF) are internally represented by one or two bytes should not matter.

Ps. Reading perldoc Encode may help to clarify issues with bytes and characters in Perl.

like image 81
Ilmari Karonen Avatar answered Sep 24 '22 05:09

Ilmari Karonen


The sender:

use Encode qw( encode_utf8 );

sub pack_text {
   my ($text) = @_;
   my $bytes = encode_utf8($text);
   die "Text too long" if length($bytes) > 4294967295;
   return pack('N/a*', $bytes);
}

The receiver:

use Encode qw( decode_utf8 );

sub read_bytes {
   my ($fh, $to_read) = @_;
   my $buf = '';
   while ($to_read > 0) {
      my $bytes_read = read($fh, $buf, $to_read, length($buf));
      die $! if !defined($bytes_read);
      die "Premature EOF" if !$bytes_read;
      $to_read -= $bytes_read;
   }
   return $buf;
}

sub read_uint32 {
   my ($fh) = @_;
   return unpack('N', read_bytes($fh, 4));
}

sub read_text {
   my ($fh) = @_;
   return decode_utf8(read_bytes($fh, read_uint32($fh)));
}
like image 25
ikegami Avatar answered Sep 25 '22 05:09

ikegami