Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way in Python to find if a string matches any terms in a list of words, phrases, boolean ANDs?

I am trying to find a fast way in Python to check if a list of terms can be matched against strings ranging in size from 50 to 50,000 characters.

A term can be:

  • A word, eg. 'apple'
  • A phrase, eg. 'cherry pie'
  • Boolean ANDing of words and phrases, eg. 'sweet pie AND savoury pie AND meringue'

A match is where a word or phrase exists around word boundaries, so:

match(term='apple', string='An apple a day.') # True
match(term='berry pie', string='A delicious berry pie.') # True
match(term='berry pie', string='A delicious blueberry pie.') # False

I currently have about 40 terms, most of them are simple words. The number of terms will increase over time, but I wouldn't expect it to get beyond 400.

I'm not interested in which term(s) a string matches, or where in the string it matches, I just need a true/false value for a match against each string - it is much more likely that no terms will match the string, so for the 1 in 500 where it does match, I can store the string for further processing.

Speed is the most important criteria, and I'd like to leverage the existing code of those smarter than me rather than trying to implement a white-paper. :)

So far the speediest solution I've come up with is:

def data():
    return [
        "The apple is the pomaceous fruit of the apple tree, species Malus domestica in the rose family (Rosaceae).",
        "This resulted in early armies adopting the style of hunter-foraging.",
        "Beef pie fillings are popular in Australia. Chicken pie fillings are too."
    ]

def boolean_and(terms):
    return '(%s)' % (''.join(['(?=.*\\b%s\\b)' % (term) for term in terms]))

def run():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and(terms) for terms in [['sweet pie', 'savoury pie', 'meringue'], ['chicken pie', 'beef pie']]]
    regex = re.compile(r'(?i)(\b(%s)\b|%s)' % ('|'.join(words_and_phrases), '|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex.search(d):
            matched_data.append(d)

The regex winds up as:

(?i)(\b(apple|cherry pie)\b|((?=.*\bsweet pie\b)(?=.*\bsavoury pie\b)(?=.*\bmeringue\b))|((?=.*\bchicken pie\b)(?=.*\bbeef pie\b)))

So all the terms are ORed together, case is ignored, the words/phrases are wrapped in \b for word boundaries, the boolean ANDs use lookaheads so that all the terms are matched, but they do not have to match in a particular order.

Timeit results:

 print timeit.Timer('run()', 'from __main__ import run').timeit(number=10000)
 1.41534304619

Without the lookaheads (ie. the boolean ANDs) this is really quick, but once they're added the speed slows down considerably.

Does anybody have ideas on how this could be improved? Is there a way to optimise the lookahead, or maybe an entirely different approach? I don't think stemming will work, as it tends to be a bit greedy with what it matches.

like image 530
johanati Avatar asked Mar 25 '11 01:03

johanati


1 Answers

The boolean AND regex with the multiple lookahead assertions can be sped up considerably by anchoring them to the beginning of the string. Or better yet, use two regexes: one for the ORed list of terms using the re.search method, and a second regex with the boolean ANDed list using the re.match method like so:

def boolean_and_new(terms):
    return ''.join([r'(?=.*?\b%s\b)' % (term) for term in terms])

def run_new():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and_new(terms) for terms in [
        ['sweet pie', 'savoury pie', 'meringue'],
        ['chicken pie', 'beef pie']]]
    regex1 = re.compile(r'(?i)\b(?:%s)\b' % ('|'.join(words_and_phrases)))
    regex2 = re.compile(r'(?i)%s' % ('|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex1.search(d) or regex2.match(d):
            matched_data.append(d)

The effective regexes for this data set are:

regex1 = r'(?i)\b(?:apple|cherry pie)\b'
regex2 = r'(?i)(?=.*?\bsweet pie\b)(?=.*?\bsavoury pie\b)(?=.*?\bmeringue\b)|(?=.*?\bchicken pie\b)(?=.*?\bbeef pie\b)'

Note that the second regex effectively has a ^ anchor at the start since its being used with the re.match method. This also includes a few extra (minor) tweaks; removing unnecessary capture groups and changing the greedy dot-star to lazy. This solution runs nearly 10 times faster than the original on my Win32 box running Python 3.0.1.

Additional: So why is it faster? Lets look at a simple example which describes how the NFA regex "engine" works. (Note that the following description derives from the classic work on the subject: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3), which explains all this efficiency stuff in great detail - highly recommended.) Lets say you have a target string containing one word and you want a regex to see if that word is: "apple". Here are two regexes one might compose to do the job:

re1 = r'^apple'
re2 = r'apple'
s = r'orange'

If your target string is: apple (or applesauce or apple-pie etc.), then both regexes will successfully match very quickly. But if your target string is say: orange, the situation is different. An NFA regex engine must try all possible permutations of the regex on the target string before it can declare match failure. The way that the regex "engine" works, is that it keeps an internal pointer to its current location within the target string, and a second pointer to a location within the regex pattern, and advanced these pointers as it goes about its business. Note that these pointers point to locations between the characters and to begin with, the target string pointer is set to the location before the first letter if the target string.

re1: The first token in the regex is the ^ start of string anchor. This "anchor" is one of the special "assertion" expressions which matches a location in a target string and does not actually match any characters. (Lookahead and lookbehind and the \b word boundary expressions are also assertions which match a location and do not "consume" any characters.) Ok, with the target string pointer initialized to the location before the first letter of the word orange, the regex engine checks if the ^ anchor matches, and it does (because this location is, in fact, the beginning of the string). So the pattern pointer is advanced to the next token in the regex, the letter a. (The target string pointer is not advanced). It then checks to see if the regex literal a matches the target string character o. It does not match. At this point, the regex engine is smart enough to know that the regex can never succeed at any other locations within the target string (because the ^ can never match anywhere but at the start). Thus it can declare match failure.

re2: In this case the engine begins by checking if the first pattern char a matches the first target char 'o'. Once again, it does not. However, in this case the regex engine is not done! It has determined that the pattern will not match at the first location, but it must try (and fail) at ALL locations with the target string before it can declare match failure. Thus, the engine advances the target string pointer to the next location (Friedl refers to this as the transmission "bump-along"). For each "bump along", it resets the pattern pointer back to the beginning. Thus it checks the first token in the pattern a against the second char in the string: r. This also does not match, so the transmission bumps along again to the next location within the string. At this point it tests the first char of the pattern a against the third char of the target: a, which does match. The engine advances both pointers and checks the second char in the regex p against the fourth character in the target n. This fails. At this point the engine declares failure at the location before the a in orange and then bumps along again to the n. It goes on like this until it fails at every location in the target string, at which point it can declare overall match failure.

For long subject strings, this extra unnecessary work can take a lot of time. Crafting an accurate and efficient regex is equal parts art and science. And to craft a really great regex, one must understand exactly how the engine actually works under the hood. Gaining this knowledge requires time and effort, but the time spent will (in my experience) pay for itself many times over. And there really is only one good place to effectively learn these skills and that is to sit down and study Mastering Regular Expressions (3rd Edition) and then practice the techniques learned. I can honestly say that this is, hands down, the most useful book I have ever read. (Its even funny!)

Hope this helps! 8^)

like image 170
ridgerunner Avatar answered Oct 08 '22 20:10

ridgerunner