Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What regex is \b equivalent to and is there a way to deparse it?

Tags:

regex

perl

I'm interested in changing the regex word boundary \b to include other characters (for example, a . wouldn't count as a boundary). I understand that it is a boundary between \w and \W characters.

  my $_ = ".test";
  if ( /(\btest\b)/ ){
    print;
    print " $1\n";
  }
  if ( /((?:(?<=\W)|^)test(?:(?=\W)|$))/ ){
    print;
    print " $1\n";
  }

This is what I came up with, and all I'd have to do is change \W to something like [^\w.], but I still want to know how Perl interprets \b in a regular expression. I tried deparsing it like this:

my $deparser = B::Deparse->new("-sC", "-x10");

print $deparser->coderef2text( sub { 
          my $_ = ".test";
          if ( /(\btest\b)/ ){
            print;
            print " $1\n";
          }
          if ( /((?:(?<=\W)|^)test(?:(?=\W)|$))/ ){
            print;
            print " $1\n";
          }
       });

I was hoping it would expand \b into what it was equivalent to. What is \b equivalent to? Can you deparse \b or other expressions further somehow?

like image 386
hmatt1 Avatar asked Sep 02 '25 01:09

hmatt1


1 Answers

\b is functionally equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w).

\B is functionally equivalent to (?<!\w)(?!\w)|(?<=\w)(?=\w).


The goal of Deparse is to produce a readable representation of Perl's understanding of the code. For example, f() and g(); and g() if f(); compile identically, so Deparse will give the more readable option, g() if f();, for both.

$ perl -MO=Deparse -e'f() and g()'
g() if f();
-e syntax OK

This means that if \b and (?<!\w)(?=\w)|(?<=\w)(?!\w) compiled to the same code, Deparse would still give you \b if it understood compiled regex. Deparse is not what you want.


Maybe you're thinking of Concise. It shows what really gets executed. Notice the use of and in the following even though the original Perl uses if:

$ perl -MO=Concise,-exec -e'g() if f()'
1  <0> enter
2  <;> nextstate(main 1 -e:1) v:{
3  <0> pushmark s
4  <#> gv[*f] s/EARLYCV
5  <1> entersub[t6] sKS/TARG
6  <|> and(other->7) vK/1
7      <0> pushmark s
8      <#> gv[*g] s/EARLYCV
9      <1> entersub[t3] vKS/TARG
a  <@> leave[1 ref] vKP/REFC
-e syntax OK

But like Deparse, Concise knows nothing of the regex program the regex engine created from the string. So this is still not what you want.


However, there is an equivalent of Concise for regex patterns: use re 'debug';.

$ perl -Mre=debug -E'qr/\b/'
Compiling REx "\b"
Final program:
   1: BOUNDU (2)
   2: END (0)
stclass BOUNDU minlen 0
Freeing REx: "\b"

Apparently, \b is implemented as its own operation. For comparison,

$ perl -Mre=debug -E'qr/(?<!\w)(?=\w)|(?<=\w)(?!\w)/'
Compiling REx "(?<!\w)(?=\w)|(?<=\w)(?!\w)"
Final program:
   1: BRANCH (12)
   2:   UNLESSM[-1] (7)
   4:     POSIXU[\w] (5)
   5:     SUCCEED (0)
   6:   TAIL (7)
   7:   IFMATCH[0] (23)
   9:     POSIXU[\w] (10)
  10:     SUCCEED (0)
  11:   TAIL (23)
  12: BRANCH (FAIL)
  13:   IFMATCH[-1] (18)
  15:     POSIXU[\w] (16)
  16:     SUCCEED (0)
  17:   TAIL (18)
  18:   UNLESSM[0] (23)
  20:     POSIXU[\w] (21)
  21:     SUCCEED (0)
  22:   TAIL (23)
  23: END (0)
minlen 0
Freeing REx: "(?<!\w)(?=\w)|(?<=\w)(?!\w)"
like image 161
ikegami Avatar answered Sep 05 '25 04:09

ikegami