Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use n-grams in whoosh

I'm trying to use n-grams to get "autocomplete-style" searches using Whoosh. Unfortunately I'm a little confused. I have made an index like this:

if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

ix = open_dir("index")

writer = ix.writer()
q = MyTable.select()
for item in q:
    print 'adding %s' % item.Title
    writer.add_document(title=item.Title, content=item.content, url = item.URL)
writer.commit()

I then search it for the title field like this:

querystring = 'my search string'

parser = QueryParser("title", ix.schema)
myquery = parser.parse(querystring)

with ix.searcher() as searcher:
    results = searcher.search(myquery)
    print len(results)

    for r in results:
        print r

and that works great. But I want to use this in autocomplete and it doesn't match partial words (eg searching for "ant" would return "ant", but not "antelope" or "anteater"). That of course greatly hampers using it for autocomplete. The Whoosh page says to use this:

analyzer = analysis.NgramWordAnalyzer()
title_field = fields.TEXT(analyzer=analyzer, phrase=False)
schema = fields.Schema(title=title_field)

But I'm confused by that. It seems to be just "the middle" of the process, when I build my index do I have to include the title field as an NGRAM field (instead of TEXT)? And how do I make a search? So when I search "ant" I get ["ant", "anteater", "antelope"] etc?

like image 855
Alex S Avatar asked Oct 02 '22 10:10

Alex S


1 Answers

I solved this problem by creating two seperate fields. One for the actual search and one for the suggestions. NGRAM or NGRAMWORDS field type can be used for "fuzzy search" functionality. In your case it would be something like this:

# not sure how your schema looks like exactly
schema = Schema(
    title=NGRAMWORDS(minsize=2, maxsize=10, stored=True, field_boost=1.0, tokenizer=None, at='start', queryor=False, sortable=False)
    content=TEXT(stored=True),
    url=title=ID(stored=True),
    spelling=TEXT(stored=True, spelling=True)) # typeahead field

if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)

ix = open_dir("index")

writer = ix.writer()
q = MyTable.select()
for item in q:
    print 'adding %s' % item.Title
    writer.add_document(title=item.Title, content=item.content, url = item.URL)
    writer.add_document(spelling=item.Title) # adding item title to typeahead field
    self.addContentToSpelling(writer, item.content) # some method that adds some content words to typeheadfield if needed. The same way as above.
writer.commit()

Then when for the search:

origQueryString = 'my search string'
words = self.splitQuery(origQueryString) # use tokenizers / analyzers or self implemented
queryString = origQueryString # would be better to actually create a query
corrector = ix.searcher().corrector("spelling")
for word in words:
    suggestionList = corrector.suggest(word, limit=self.limit)
    for suggestion in suggestionList:
         queryString = queryString + " " + suggestion # would be better to actually create a query      

parser = QueryParser("title", ix.schema)
myquery = parser.parse(querystring)

with ix.searcher() as searcher:
     results = searcher.search(myquery)
     print len(results)

    for r in results:
        print r

Hope you get the idea.

like image 183
Terran Avatar answered Oct 05 '22 23:10

Terran