Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching and indexing hyphenated words with Whoosh

Tags:

python

whoosh

I am using Whoosh to index and search a large number of documents, and many of the things I need to search on are hyphenated. Whoosh seems to treat hyphens as a special character of some kind, but for the life of me I can't figure out it's behavior.

Can anyone advise on how Whoosh treats hyphens while indexing and searching?

like image 762
Jeremy Watson Avatar asked Oct 30 '22 06:10

Jeremy Watson


1 Answers

Whoosh simply treats all punctuation as a space. Assuming a default AND search, the query dual-scale thermometer is equivalent to dual AND scale AND thermometer. This will find a document containing dual-scale digital thermometer, but it will also find dual purpose bathroom scale with thermometer.

One solution to avoid this is to turn the hyphenated words in your query into phrases: "dual-scale" thermometer, which is the equivalent of "dual scale" AND thermometer.

You could also force Whoosh to accept hyphens as part of a word. You do this by overriding the RegexTokenizer expression in the StandardAnalyzer with a regular expression that accepts hyphens as a valid part of a token.

    from whoosh import fields, analysis

    myanalyzer = analysis.StandardAnalyzer(expression=r'[\w-]+(\.?\w+)*')
    schema = fields.Schema(myfield=fields.TEXT(analyzer=myanalyzer))

Now a search for dual-scale thermometer is equivalent to dual-scale AND thermometer and will find dual-scale digital thermometer but not "dual purpose bathroom scale with thermometer".

However, you won't be able to search for hyphenated words independently. If your document contained high-quality components, you would not match it if you searched for quality; only high-quality, because this has now become one token. Because of this side-effect, unless your content is strictly constrained in its use of hyphens to truly atomic hyphenated words, I would recommend the phrase approach.

like image 179
Steven Avatar answered Nov 15 '22 07:11

Steven