Consider the following input data in file y.txt
(encoded in UTF-8).
bar
föbar
and a file y.pl
, which puts the two input lines into an array and processes them, looking for substring start positions.
use open qw(:std :utf8);
my @array;
while (<>) {
push @array, $_;
print $-[0] . "\n" if /bar/;
}
# $array[0] = "bar", $array[1] = "föbar"
print $-[0] . "\n" if $array[1] =~ /$array[0]/u;
If I call perl y.pl < y.txt
, I get
0
2
3
as the output. However, I would expect that the last number is 2 also, but for some reason the second /.../
regexp behaves differently. What am I missing? I guess it's an encoding issue, but whatever I tried, I didn't succeed. This is Perl 5.18.2.
It appears to be a bug in 5.18.
$ 5.18.2t/bin/perl a.pl a
0
2
3
$ 5.20.1t/bin/perl a.pl a
0
2
2
I can't find a workaround. Adding utf8::downgrade($array[0]);
or utf8::downgrade($array[0], 1);
works in the case you presented, but not using the following data or any other where the interpolated pattern contains characters >255.
♠bar
f♠♠bar
It appears that this can only be fixed by upgrading your Perl, which is actually quite simple. (Just make sure to install it to a different directory than your system perl
by following the instructions in INSTALL
!)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With