Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine regular expression's specificity

Tags:

c

regex

pcre

Given the following regular expressions:

 - alice@[a-z]+\.[a-z]+
 - [a-z]+@[a-z]+\.[a-z]+
 - .*

The string [email protected] will obviously match all three regular expressions. In the application I am developing, we are only interested in the 'most specific' match. In this case this is obviously the first one.
Unfortunately there seems no way to do this. We are using PCRE and I did not find a way to do this and a search on the Internet was also not fruitful.
A possible way would be to keep the regular expressions sorted on descending specificity and then simply take the first match. Of course then the next question would be how to sort the array of regular expressions. It is not an option to give the responsability to the end-user to ensure that the array is sorted. So I hope you guys could help me out here...

Thanks !!

Paul

like image 337
Paul Praet Avatar asked Aug 31 '10 18:08

Paul Praet


2 Answers

The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.

Basically, the most specific match is from the pattern that is not a superset of any other matching pattern. For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal. This lets you build a tree of matches. For a particular input string, you go through the root patterns and find any matches. Then go through their subpatterns. If at any point, orthogonal patterns match, an error is raised.

Setup

import re

class RegexElement:
    def __init__(self, string,index):
        self.string=string
        self.supersets = []
        self.subsets = []
        self.disjoints = []
        self.intersects = []
        self.maybes = []
        self.precompilation = {}
        self.compiled = re.compile(string,re.IGNORECASE)
        self.index = index

SUPERSET  = object()
SUBSET    = object()
INTERSECT = object()
DISJOINT  = object()
EQUAL     = object()

The Tests

Each test takes 2 strings (a and b) and tries to determine how they are related. If the test cannot determine the relation, None is returned.

SUPERSET means a is a superset of b. All matches of b will match a.

SUBSET means b is a superset of a.

INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.

DISJOINT means no matches of a will match b.

EQUAL means all matches of a will match b and all matches of b will match a.

    def equal_test(a, b):  
        if a == b: return EQUAL

The graph

  class SubsetGraph(object):
    def __init__(self, tests):
        self.regexps = []
        self.tests = tests
        self._dirty = True
        self._roots = None

    @property
    def roots(self):
        if self._dirty:
            r = self._roots = [i for i in self.regexps if not i.supersets]
            return r
        return self._roots

    def add_regex(self, new_regex):
        roots = self.roots
        new_re = RegexElement(new_regex)
        for element in roots:
            self.process(new_re, element)
        self.regexps.append(new_re)

    def process(self, new_re, element):
        relationship = self.compare(new_re, element)
        if relationship:
            getattr(self, 'add_' + relationship)(new_re, element)

    def add_SUPERSET(self, new_re, element):
        for i in element.subsets:
            i.supersets.add(new_re)
            new_re.subsets.add(i)

        element.supersets.add(new_re)
        new_re.subsets.add(element)

    def add_SUBSET(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        element.subsets.add(new_re)
        new_re.supersets.add(element)

    def add_DISJOINT(self, new_re, element):
        for i in element.subsets:
            i.disjoints.add(new_re)
            new_re.disjoints.add(i)

        new_re.disjoints.add(element)
        element.disjoints.add(new_re)

    def add_INTERSECT(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        new_re.intersects.add(element)
        element.intersects.add(new_re)

    def add_EQUAL(self, new_re, element):
        new_re.supersets = element.supersets.copy()
        new_re.subsets = element.subsets.copy()
        new_re.disjoints = element.disjoints.copy()
        new_re.intersects = element.intersects.copy()

    def compare(self, a, b):
        for test in self.tests:
            result = test(a.string, b.string)
            if result:
                return result

    def match(self, text, strict=True):
        matches = set()
        self._match(text, self.roots, matches)
        out = []
        for e in matches:
            for s in e.subsets:
                if s in matches:
                    break
            else:
                out.append(e)
        if strict and len(out) > 1:
            for i in out:
                print(i.string)
            raise Exception("Multiple equally specific matches found for " + text)
        return out

    def _match(self, text, elements, matches):
        new_elements = []
        for element in elements:
            m = element.compiled.match(text)
            if m:
                matches.add(element)
                new_elements.extend(element.subsets)
        if new_elements:
            self._match(text, new_elements, matches)

Usage

graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")

A complete usable version is here.

like image 199
Perkins Avatar answered Sep 28 '22 10:09

Perkins


My gut instinct says that not only is this a hard problem, both in terms of computational cost and implementation difficulty, but it may be unsolvable in any realistic fashion. Consider the two following regular expressions to accept the string [email protected]

    alice@[a-z]+\.[a-z]+ 
    [a-z][email protected]

Which one of these is more specific?

like image 38
torak Avatar answered Sep 28 '22 11:09

torak