Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't Perl v5.22 find all the sentence boundaries?

This is fixed in Perl 5.22.1. I write about it in Perl v5.22 adds fancy Unicode word boundaries.


Perl v5.22 added the Unicode assertions from TR #29. I've been playing with the sentence boundary assertion, but it only seems to find the start and end of text:

use v5.22;

$_ = "See Spot. (Spot is a dog.) See Spot run. Run Spot, run!\x{2029}New paragraph.";

while( m/\b{sb}/g ) {
    say "Sentence boundary at ", pos;
    }

The output notes sentence boundaries at the start and end of text, but not after the full stops, the sentence terminators, or the parens:

Sentence boundary at 0
Sentence boundary at 70

The Unicode breaks tester shows them mostly I expect them based on TR #29.

I couldn't find any non-trivial tests in the perl source for this feature. I'm digesting technical report to create appropriate test cases, but so far this looks like another untested and broken feature.

like image 377
brian d foy Avatar asked Apr 25 '16 05:04

brian d foy


1 Answers

Calle Dybedahl's comment gets it right (and when they turn it into an answer I'll accept that). This was a broken feature in v5.22.0, and as far as I can tell, untested. I had an issue compiling stuff the latest perls last night and ended the day with the question.

The perl5.22.1 perldelta does not mention the particular changes (and "mention" might be too strong since it merely alludes to possible things that were wrong without enumerating them). It mentions as incompatible change with 5.20.0 (a cut and paste error?), a "single" exception, then more than one issue. The reference to "sane" made me think that all of the changes were related to the panic issue in the next subsection. The mention of "several bugs" with only one rt.perl.org reference made me think those bugs were related to the panic issue.

=head1 Incompatible Changes

There are no changes intentionally incompatible with 5.20.0 other than the following single exception, which we deemed to be a sensible change to make in order to get the new C<\b{wb}> and (in particular) C<\b{sb}> features sane before people decided they're worthless because of bugs in their Perl 5.22.0 implementation and avoided them in the future. If any others exist, they are bugs, and we request that you submit a report. See L below.

=head2 Bounds Checking Constructs

Several bugs, including a segmentation fault, have been fixed with the bounds checking constructs (introduced in Perl 5.22) C<\b{gcb}>, C<\b{sb}>, C<\b{wb}>, C<\B{gcb}>, C<\B{sb}>, and C<\B{wb}>. All the C<\B{}> ones now match an empty string; none of the C<\b{}> ones do. L<[perl #126319]|https://rt.perl.org/Ticket/Display.html?id=126319>

Additionally, perlrebackslash, where the new boundaries are documented, doesn't mention that they don't work in v5.22.0.

I disregarded a possible fix because of incongruities in the perldelta and the prior experience I've had that new features aren't adequately (or even at all) tested in the perl source. I prematurely cut off that line of investigation and could have saved myself a couple of hours. It's certainly my fault for not getting the code running on the latest binaries, but I had become fixated on the idea that I was doing something wrong and that my code was the problem. Despite my numerous past experiences to the contrary, I wasn't entertaining thoughts (other than an update to the UCD) that perl was wrong.

Now that I'm at a different machine and have a working perl-5.22.1, I see that my program works as expected in the point release. The perldelta could have been much better here.

like image 189
brian d foy Avatar answered Oct 08 '22 10:10

brian d foy