I'm trying to use n-grams to get "autocomplete-style" searches using Whoosh. Unfortunately I'm a little confused. I have made an index like this:
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
ix = open_dir("index")
writer = ix.writer()
q = MyTable.select()
for item in q:
print 'adding %s' % item.Title
writer.add_document(title=item.Title, content=item.content, url = item.URL)
writer.commit()
I then search it for the title field like this:
querystring = 'my search string'
parser = QueryParser("title", ix.schema)
myquery = parser.parse(querystring)
with ix.searcher() as searcher:
results = searcher.search(myquery)
print len(results)
for r in results:
print r
and that works great. But I want to use this in autocomplete and it doesn't match partial words (eg searching for "ant" would return "ant", but not "antelope" or "anteater"). That of course greatly hampers using it for autocomplete. The Whoosh page says to use this:
analyzer = analysis.NgramWordAnalyzer()
title_field = fields.TEXT(analyzer=analyzer, phrase=False)
schema = fields.Schema(title=title_field)
But I'm confused by that. It seems to be just "the middle" of the process, when I build my index do I have to include the title field as an NGRAM field (instead of TEXT)? And how do I make a search? So when I search "ant" I get ["ant", "anteater", "antelope"] etc?
I solved this problem by creating two seperate fields. One for the actual search and one for the suggestions. NGRAM or NGRAMWORDS field type can be used for "fuzzy search" functionality. In your case it would be something like this:
# not sure how your schema looks like exactly
schema = Schema(
title=NGRAMWORDS(minsize=2, maxsize=10, stored=True, field_boost=1.0, tokenizer=None, at='start', queryor=False, sortable=False)
content=TEXT(stored=True),
url=title=ID(stored=True),
spelling=TEXT(stored=True, spelling=True)) # typeahead field
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
ix = open_dir("index")
writer = ix.writer()
q = MyTable.select()
for item in q:
print 'adding %s' % item.Title
writer.add_document(title=item.Title, content=item.content, url = item.URL)
writer.add_document(spelling=item.Title) # adding item title to typeahead field
self.addContentToSpelling(writer, item.content) # some method that adds some content words to typeheadfield if needed. The same way as above.
writer.commit()
Then when for the search:
origQueryString = 'my search string'
words = self.splitQuery(origQueryString) # use tokenizers / analyzers or self implemented
queryString = origQueryString # would be better to actually create a query
corrector = ix.searcher().corrector("spelling")
for word in words:
suggestionList = corrector.suggest(word, limit=self.limit)
for suggestion in suggestionList:
queryString = queryString + " " + suggestion # would be better to actually create a query
parser = QueryParser("title", ix.schema)
myquery = parser.parse(querystring)
with ix.searcher() as searcher:
results = searcher.search(myquery)
print len(results)
for r in results:
print r
Hope you get the idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With