For several hours now I am fighting a bug in my Perl program. I am not sure if I do something wrong or the interpreter does, but the code is non-deterministic while it should be deterministic, IMO. Also it exhibits the same behavior on ancient Debian Lenny (Perl 5.10.0) and a server just upgraded to Debian Wheezy (Perl 5.14.2). It boiled down to this piece of Perl code:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
It initializes Perl 5 interpreter in strict mode with warnings enabled, with character strings (as opposed to byte strings) and named standard streams encoded in UTF8 (internal notion of UTF-8, but pretty close; changing to full UTF-8 makes no difference). Then it opens a file handle to an “in-memory file” (scalar variable), prints a single two-byte UTF-8 character into it and examines the variable upon closure.
The scalar variable now always has UTF8 bit flipped off. However it sometimes contains a byte string (converted to character string via utf8::decode()
) and sometimes a character string that just needs to flip on its UTF8 bit (Encode::_utf8_on()
).
When I execute my code repeatedly (1000 times, via Bash), it prints Undecoded
and Decoded
with approximately the same frequencies. When I change the string I write into the “file”, e.g. add a newline at its end, Undecoded
disappears. When utf8::decode
succeeds and I try it for the same original string in a loop, it keeps succeeding in the same instance of interpreter; however, if it fails, it keeps failing.
What is the explanation for the observed behavior? How can I use file handle to a scalar variable together with character strings?
Bash playground:
for i in {1..1000}; do perl -we 'use strict; use utf8; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my $c = ""; open C, ">:utf8", \$c; print C "š"; close C; die "Does not happen\n" if utf8::is_utf8($c); print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";'; done | grep Undecoded | wc -l
For reference and to be absolutely sure, I also made a version with pedantic error handling – same results.
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8" or die "Cannot binmode STDOUT\n";
binmode STDERR, ":utf8" or die "Cannot binmode STDERR\n";
my $c = "";
open C, ">:utf8", \$c or die "Cannot open: $!\n";
print C "š" or die "Cannot print: $!\n";
close C or die "Cannot close: $!\n";
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Examining $c
in details reveals it has nothing to do with the content of $c
or its internals, and the result of decode
accurately represents what it did or didn't do.
$ for i in {1..2}; do
perl -MDevel::Peek -we'
use strict; use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
Dump($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Dump($c)
'
echo
done
SV = PV(0x17c8470) at 0x17de990
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x17d7a40 "\305\241"
CUR = 2
LEN = 16
Decoded
SV = PV(0x17c8470) at 0x17de990
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x17d7a40 "\305\241" [UTF8 "\x{161}"]
CUR = 2
LEN = 16
SV = PV(0x2d0fee0) at 0x2d26400
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2d1f4b0 "\305\241"
CUR = 2
LEN = 16
Undecoded
SV = PV(0x2d0fee0) at 0x2d26400
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2d1f4b0 "\305\241"
CUR = 2
LEN = 16
This was a bug in utf8::decode
, but it was fixed in 5.16.3 or earlier, probably 5.16.0 since it was still present in 5.14.2.
A suitable workaround it to use Encode's decode_utf8
instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With