Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl - regex - Position of first nonmatching character

I want to find the position in a string, where a regular expression stops matching.

Simple example:

my $x = 'abcdefghijklmnopqrstuvwxyz';
$x =~ /gho/;

This example shall give me the position of the character 'h' because 'h' matches and 'o' is the first nonmatching character.

I thought of using pos or $- but it is not written on unsuccessful match. Another solution would be to iteratively shorten the regex pattern until it matches but that's very ugly and doesn't work on complex patterns.

EDIT:

Okay for the linguists: I'm sorry for my awful explanation.

To clarify my situation: If you think of a regular expression as a finite automaton, there is a point, where the testing interrupts, because a character doesn't fit. This point is what I'm searching for.

Use of iterative paranthesis (as mentioned by eugene y) is a nice idea, but it doesn't work with quantifiers and I had to edit the pattern.

Are there other ideas?

like image 240
Hachi Avatar asked Jan 18 '23 13:01

Hachi


2 Answers

What you are proposing is difficult but doable.

If I can paraphrase what I understand, you are wanting to find out how far a failing match got into a match. In order to do this, you need to be able to parse a regex.

The best regex parser is probably to use Perl itself with the -re=debug command line switch:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{5}/'
Compiling REx "gh[ijkl]{5}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {5,5} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 7 
Guessing start of match in sv for REx "gh[ijkl]{5}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{5}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {5,5}(16)
                                  ANYOF[i-l][] can match 4 times out of 5...
                                  failed...
Match failed
Freeing REx: "gh[ijkl]{5}"

You can shell out that Perl command line with your regex and parse the return of stdout. Look for the `

Here is a matching regex:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{3}/'
Compiling REx "gh[ijkl]{3}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {3,3} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 5 
Guessing start of match in sv for REx "gh[ijkl]{3}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{3}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {3,3}(16)
                                  ANYOF[i-l][] can match 3 times out of 3...
  11 <ghijk> <lmnopqr>       | 16:  END(0)
Match successful!
Freeing REx: "gh[ijkl]{3}"

You will need to build a parser that can handle the return from the Perl re debugger. The left hand and right hand angle braces show the distance into the string as the regex engine is trying to match.

This is not an easy project btw...

like image 199
dawg Avatar answered Jan 28 '23 07:01

dawg


You can get the matching part, and use the index function to find its position:

my $x = 'abcdefghijklmnopqrstuvwxyz';

$x =~ /(g(h(o)?)?)/;
print index($x, $1) + length($1), "\n"; #8
like image 25
Eugene Yarmash Avatar answered Jan 28 '23 05:01

Eugene Yarmash