Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the caret ^ in (?^:…) mean in the string form of a Perl qr// Regex?

Tags:

regex

perl

Consider this script, which is based on an answer to SO 267399 about parsing Roman numbers, though the parsing of Roman numbers is incidental to this question.

#!/usr/bin/env perl
#
# Based on answer to SO 0026-7399

use warnings;
use strict;

my $qr1 = qr/(?i:M{1,3})/;
my $qr2 = qr/(?i:C[MD]|D?C{1,3})/;
my $qr3 = qr/(?i:X[CL]|L?X{1,3})/;
my $qr4 = qr/(?i:I[XV]|V?I{1,3})/;

print "1000s: $qr1\n";
print " 100s: $qr2\n";
print "  10s: $qr3\n";
print "   1s: $qr4\n";

# This $qr is too simple — it matches the empty string
#my $qr = qr/($qr1?$qr2?$qr3?$qr4?)/;

my $qr = qr/\b((?:$qr1$qr2?$qr3?$qr4?)|(?:$qr2$qr3?$qr4?)|(?:$qr3$qr4?)|(?:$qr4))\b/;

print " Full: $qr\n";

while (<>)
{
    chomp;
    print " Line: [$_]\n";
    while ($_ =~ m/$qr/g)
    {
        print "Match: [$1] found in [$_] using qr//\n";
    }
}

Given the data file below, the first three lines each contain a Roman number.

mix in here
no mix in here
mmmcmlxxxix
minimum

When run with (home-built) Perl 5.22.0 on a Mac now running macOS Sierra 10.12.4, I get output like this (but the version of Perl is not critical):

1000s: (?^:(?i:M{1,3}))
 100s: (?^:(?i:C[MD]|D?C{1,3}))
  10s: (?^:(?i:X[CL]|L?X{1,3}))
   1s: (?^:(?i:I[XV]|V?I{1,3}))
 Full: (?^:\b((?:(?^:(?i:M{1,3}))(?^:(?i:C[MD]|D?C{1,3}))?(?^:(?i:X[CL]|L?X{1,3}))?(?^:(?i:I[XV]|V?I{1,3}))?)|(?:(?^:(?i:C[MD]|D?C{1,3}))(?^:(?i:X[CL]|L?X{1,3}))?(?^:(?i:I[XV]|V?I{1,3}))?)|(?:(?^:(?i:X[CL]|L?X{1,3}))(?^:(?i:I[XV]|V?I{1,3}))?)|(?:(?^:(?i:I[XV]|V?I{1,3}))))\b)
 Line: [mix in here]
Match: [mix] found in [mix in here] using qr//
 Line: [no mix in here]
Match: [mix] found in [no mix in here] using qr//
 Line: [mmmcmlxxxix]
Match: [mmmcmlxxxix] found in [mmmcmlxxxix] using qr//
 Line: [minimum]

The only part of the output that I don't understand is the caret ^ in the (?^:…) notation.

I've looked at Perl documentation for perlre and perlref and even the section of perlop on 'Regex quote-like operators' without seeing this exemplified or explained. (I also checked the resources suggested by SO when you ask a question about regexes. The (?^: string is carefully designed to give search engines conniptions.)

There are two parts to my question:

  1. What is the significance of the caret in (?^:…) and what caused it to be added to the qr// regexes?
  2. If it matters, how do you stop it from being added to the qr// regexes?
like image 390
Jonathan Leffler Avatar asked May 10 '17 21:05

Jonathan Leffler


People also ask

What does the caret mean in regex?

The caret (^) matches the beginning of a line. The dollar sign ($) matches the end of a line. The dot (.) matches any character. A single character that doesn't have any other special meaning matches that character.

What does =~ mean in Perl?

=~ is the Perl binding operator. It's generally used to apply a regular expression to a string; for instance, to test if a string matches a pattern: if ($string =~ m/pattern/) {

What does qr mean in Perl?

qr// is one of the quote-like operators that apply to pattern matching and related activities. From perldoc: This operator quotes (and possibly compiles) its STRING as a regular expression. STRING is interpolated the same way as PATTERN in m/PATTERN/. If ' is used as the delimiter, no interpolation is done.

How do I match a string in regex in Perl?

Simple word matching In this statement, World is a regex and the // enclosing /World/ tells Perl to search a string for a match. The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match.


2 Answers

Basically it means the default flags apply (even if it gets interpolated into a regex that specifies differently). Before it was introduced, qr would produce something like (?-ismx: and a new flag being added to Perl would make that change, which m ade keeping tests up to date a pain.

http://perldoc.perl.org/perlre.html#Extended-Patterns:

Starting in Perl 5.14, a "^" (caret or circumflex accent) immediately after the "?" is a shorthand equivalent to d-imnsx . Flags (except "d" ) may follow the caret to override it. But a minus sign is not legal with it.

like image 113
ysth Avatar answered Sep 20 '22 05:09

ysth


It means "set all flags (such as i, s) to their defaults", so

$ perl -le'my $re = "a"; for (qw( a A )) { print "$_: ", /$re/i ? "match" : "no match"; }'
a: match
A: match

$ perl -le'my $re = "(?^:a)"; for (qw( a A )) { print "$_: ", /$re/i ? "match" : "no match"; }'
a: match
A: no match

It's primarily used to represent patterns created by qr//.

$ perl -le'my $re = qr/a/; print $re; for (qw( a A )) { print "$_: ", /$re/i ? "match" : "no match"; }'
(?^:a)
a: match
A: no match

$ perl -le'my $re = qr/a/i; print $re; for (qw( a A )) { print "$_: ", /$re/i ? "match" : "no match"; }'
(?^i:a)
a: match
A: match
like image 28
ikegami Avatar answered Sep 20 '22 05:09

ikegami