Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perl's $-[0] produces unexpected results for non-ASCII data

Tags:

regex

utf-8

perl

Consider the following input data in file y.txt (encoded in UTF-8).

bar
föbar

and a file y.pl, which puts the two input lines into an array and processes them, looking for substring start positions.

use open qw(:std :utf8);

my @array;

while (<>) {
  push @array, $_;
  print $-[0] . "\n" if /bar/;
}

# $array[0] = "bar", $array[1] = "föbar"
print $-[0] . "\n" if $array[1] =~ /$array[0]/u;

If I call perl y.pl < y.txt, I get

0
2
3

as the output. However, I would expect that the last number is 2 also, but for some reason the second /.../ regexp behaves differently. What am I missing? I guess it's an encoding issue, but whatever I tried, I didn't succeed. This is Perl 5.18.2.

like image 840
lemzwerg Avatar asked Sep 19 '16 05:09

lemzwerg


1 Answers

It appears to be a bug in 5.18.

$ 5.18.2t/bin/perl a.pl a
0
2
3

$ 5.20.1t/bin/perl a.pl a
0
2
2

I can't find a workaround. Adding utf8::downgrade($array[0]); or utf8::downgrade($array[0], 1); works in the case you presented, but not using the following data or any other where the interpolated pattern contains characters >255.

♠bar
f♠♠bar

It appears that this can only be fixed by upgrading your Perl, which is actually quite simple. (Just make sure to install it to a different directory than your system perl by following the instructions in INSTALL!)

like image 97
ikegami Avatar answered Nov 16 '22 14:11

ikegami