Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I identify the "tokens" (wrong word) of a regular expression

I'm working on a fairly specialized search engine implementation in Perl, it searches (by regex) documents for specifically delimited (a subset of :punct:) strings from a text file. I'm doing the usual search engine indexing tricks, but there's a problem.

Some of the search regex patterns include, by necessity, delimiters used in the file. "Ok, I think to myself, "word proximity, then... easy" ...and that side of the equation is straight forward enough.

The trick is that because the search patterns are regular expressions, I haven't easily determined the specific words that I should go looking for in the indexed data (think "split" if we're talking about more ordinary strings).

Trivial example, "square[\s-]*dance" would match directly on "squaredance" but a proximity match on "square dance" and "square-dance" (since '-' is a delimiter). I need to know, based on the regex, to look for "square" and "dance" separately, but nearby each other.

I'm game for the challenge, but I'd rather use established code. My gut tells me that it'll be an internal hook to the regex engine, but I don't know of anything like that. Any suggestions?

like image 956
Trueblood Avatar asked May 10 '10 18:05

Trueblood


1 Answers

The re pragma can produce the information you seem to be interested in.

use strict;
use warnings;
use re qw(Debug DUMP);

my $re = qr/square[\s-]*dance/;

'Let\'s go to the square dance!' =~ $re;

Output:

Compiling REx "square[\s-]*dance"
Final program:
   1: EXACT <square> (4)
   4: STAR (17)
   5:   ANYOF[\11\12\14\15 \-][+utf8::IsSpacePerl] (0)
  17: EXACT <dance> (20)
  20: END (0)
anchored "square" at 0 floating "dance" at 6..2147483647 (checking anchored) minlen 11 
Freeing REx: "square[\s-]*dance"

Unfortunately, there doesn't appear to be a programmatic hook to get this information. You'd have to intercept the output on STDERR and parse it. Rough proof-of-concept:

sub build_regexp {
    my $string = shift;
    my $dump;

    # save off STDERR and redirect to scalar
    open my $stderr, '>&', STDERR or die "Can't dup STDERR";
    close STDERR;
    open STDERR, '>', \$dump or die;

    # Compile regexp, capturing DUMP output in $dump
    my $re = do {
        use re qw(Debug DUMP);
        qr/$string/;
    };

    # Restore STDERR
    close STDERR;
    open STDERR, '>&', $stderr or die "Can't restore STDERR";

    # Parse DUMP output
    my @atoms = grep { /EXACT/ } split("\n", $dump);

    return $re, @atoms;
}

Use it this way:

my ($re, @atoms) = build_regexp('square[\s-]*dance');

$re contains the pattern, @atoms contains a lists of the literal portions of the pattern. In this case, that's

   1: EXACT <square> (4)
  17: EXACT <dance> (20)
like image 62
Michael Carman Avatar answered Oct 23 '22 01:10

Michael Carman