Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python advanced string searching with operators and boolean

I have a function that searches a string in a list of lists then returns a list containing the matching lists:

def foo(myList,keyword,first=True):
    if first: #Search only first element or each sublist
        return [x for x in myList if keyword in x]
    else: #Search first and second elements of each sublist
        return [x for x in myList if keyword in x or keyword in x[1]]

Now I want to extend it to handle advanced searching with queries like:

matchthis -butnothis -"and not this"

this|orthis|"or this"

brand new*laptop  # this is a wildcard, matches like: brand new dell laptop

"exact phrase"

Are there any python modules (preferably builtin) that I can use in my function to handle these queries?

PS: I'm aware of Swoosh but it's not the right fit for me at the moment. Also, I'm currently using App Engine.

What I'm trying to do basically is full-text search in memory, since app engine doens't support full-text search yet. I query the datastore, put the entities into lists and loop through those lists to find query matches.

like image 593
userBG Avatar asked Dec 30 '11 01:12

userBG


2 Answers

I would try constructing a regex for each portion of the search query. First you could break the query into the portions using shlex.split(), and then create each regex individually. Here is my crack at it:

import shlex, re

def foo(query):
    pieces = shlex.split(query)
    include, exclude = [], []
    for piece in pieces:
        if piece.startswith('-'):
            exclude.append(re.compile(piece[1:]))
        else:
            include.append(re.compile(piece))
    def validator(s):
        return (all(r.search(s) for r in include) and
                not any(r.search(s) for r in exclude))
    return validator

This will return a function that you can use to validate against the query, for example:

>>> test = foo('matchthis -butnothis -"and not this"')
>>> test("we should matchthis...")
True
>>> test("some stuff matchthis blah and not this...")
False

You should be able to add in some wildcard handling by replacing * in the query with .* in the regex.

like image 193
Andrew Clark Avatar answered Oct 29 '22 17:10

Andrew Clark


There's no one standard library module that does all of what you want; however, you can start with the shlex module to parse the search groups:

>>> import shlex
>>> s = '''matchthis -butnothis -"and not this"
this|orthis|"or this"
brand new*laptop
"exact phrase"
'''
>>> shlex.split(s)
['matchthis', '-butnothis', '-and not this', 'this|orthis|or this', 'brand', 'new*laptop', 'exact phrase']

You can also look at the re module in case you need more fine grained control over the parsing.

like image 23
Raymond Hettinger Avatar answered Oct 29 '22 17:10

Raymond Hettinger