Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating search term suggestions with Whoosh?

I've got a set of documents in a Whoosh index, and I want to provide a search term suggestion feature. So If you type "pop", some suggestions that could come up might be:

  • popcorn
  • popular
  • pope
  • Poplar Film
  • pop culture

I've got the terms that should be coming up as suggestions going into an NGRAMWORDS field in my index, but when I do a query on that field I get autocompleted results rather than the expanded suggestions - so I get documents tagged with "pop culture", but no way to show that term to the user. (For comparison, I'd do this in ElasticSearch using a completion mapping on that field and then use the _suggest endpoint to get the suggestions.)

I can only find examples for autocomplete or spelling correction in the documentation or elsewhere on on the web. Is there any way I can get search term suggestions from my index with Whoosh?

Edit: expand_prefix was a much-needed pointer in the right direction. I've ended up using a KEYWORD(commas=True, lowercase=True) for my suggest field, and code like this to get suggestions in most-common-first order (expand_prefix and iter_prefix will yield them in alphabetical order):

def get_suggestions(term):
    with ix.reader() as r:
        suggestions = [(s[0], s[1].doc_frequency()) for s in r.iter_prefix('suggest', term)]
    return sorted(suggestions, key=itemgetter(1), reverse=True)
like image 222
babbageclunk Avatar asked Oct 24 '25 18:10

babbageclunk


2 Answers

Term Frequency Functions

I want to add to the answers here that there is actually a builtin function in whoosh that returns the top 'number' terms by term frequency. It is in the whoosh docs.

whoosh.reading.IndexReader.most_frequent_terms(fieldname, number=5, prefix='')

tf-idf vs. frequency

Also, on the same page of the docs, right above the previous function in the whoosh docs is a function that returns the most distinctive terms rather than the most frequent. It uses the tf-idf score, which is effective at eliminating common but insignificant words like 'the'. This could be more or less useful depending on what you are looking for. it is appropriately named:

whoosh.reading.IndexReader.most_distinctive_terms(fieldname, number=5, prefix='')

Each of these would be used in this fashion:

with ix.reader() as r:
    print r.most_frequent_terms('suggestions', number=5, prefix='pop')
    print r.most_distinctive_terms('suggestions', number=5, prefix='pop')

Multi-Word Suggestions

As well, I have had problems with multi-word suggestions. My solution was to create a schema in the following way:

fields.Schema(suggestions = fields.TEXT(),
              suggestion_phrases = fields.KEYWORD(commas=True, lowercase=True)

In the suggestion_phrases field, commas=True allows keywords to be stored with spaces and therefore have multiple words, and lowercase=True ignores capitalization (This can be removed if it is necessary to distinguish between capitalized and non-capitalized terms). Then, in order to get both single and multi-word suggestions, you would run either most_frequent_terms() or most_distinctive_terms() on both fields. Then combine the results.

like image 127
Phillip Martin Avatar answered Oct 27 '25 07:10

Phillip Martin


This is not what you are looking for exactly, but probably can help you:

reader = index.reader()
for x in r.expand_prefix('title', 'pop'):
  print x

Output example:

pop
popcorn
popular

Update

Another workaround is to build another index with keywords as TEXT only. And play with search language. What I could achieve:

In [12]: list(ix.searcher().search(qp.parse('pop*')))
Out[12]: 
[<Hit {'keywords': u'popcorn'}>,
 <Hit {'keywords': u'popular'}>,
 <Hit {'keywords': u'pope'}>,
 <Hit {'keywords': u'Popular Film'}>,
 <Hit {'keywords': u'pop culture'}>]
like image 43
Dmitry Nedbaylo Avatar answered Oct 27 '25 07:10

Dmitry Nedbaylo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!