I'm working on a fairly specialized search engine implementation in Perl, it searches (by regex) documents for specifically delimited (a subset of :punct:) strings from a text file. I'm doing the usual search engine indexing tricks, but there's a problem.
Some of the search regex patterns include, by necessity, delimiters used in the file. "Ok, I think to myself, "word proximity, then... easy" ...and that side of the equation is straight forward enough.
The trick is that because the search patterns are regular expressions, I haven't easily determined the specific words that I should go looking for in the indexed data (think "split" if we're talking about more ordinary strings).
Trivial example, "square[\s-]*dance" would match directly on "squaredance" but a proximity match on "square dance" and "square-dance" (since '-' is a delimiter). I need to know, based on the regex, to look for "square" and "dance" separately, but nearby each other.
I'm game for the challenge, but I'd rather use established code. My gut tells me that it'll be an internal hook to the regex engine, but I don't know of anything like that. Any suggestions?
The re
pragma can produce the information you seem to be interested in.
use strict;
use warnings;
use re qw(Debug DUMP);
my $re = qr/square[\s-]*dance/;
'Let\'s go to the square dance!' =~ $re;
Output:
Compiling REx "square[\s-]*dance"
Final program:
1: EXACT <square> (4)
4: STAR (17)
5: ANYOF[\11\12\14\15 \-][+utf8::IsSpacePerl] (0)
17: EXACT <dance> (20)
20: END (0)
anchored "square" at 0 floating "dance" at 6..2147483647 (checking anchored) minlen 11
Freeing REx: "square[\s-]*dance"
Unfortunately, there doesn't appear to be a programmatic hook to get this information. You'd have to intercept the output on STDERR and parse it. Rough proof-of-concept:
sub build_regexp {
my $string = shift;
my $dump;
# save off STDERR and redirect to scalar
open my $stderr, '>&', STDERR or die "Can't dup STDERR";
close STDERR;
open STDERR, '>', \$dump or die;
# Compile regexp, capturing DUMP output in $dump
my $re = do {
use re qw(Debug DUMP);
qr/$string/;
};
# Restore STDERR
close STDERR;
open STDERR, '>&', $stderr or die "Can't restore STDERR";
# Parse DUMP output
my @atoms = grep { /EXACT/ } split("\n", $dump);
return $re, @atoms;
}
Use it this way:
my ($re, @atoms) = build_regexp('square[\s-]*dance');
$re
contains the pattern, @atoms
contains a lists of the literal portions of the pattern. In this case, that's
1: EXACT <square> (4)
17: EXACT <dance> (20)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With