I am using Whoosh to index and search a large number of documents, and many of the things I need to search on are hyphenated. Whoosh seems to treat hyphens as a special character of some kind, but for the life of me I can't figure out it's behavior.
Can anyone advise on how Whoosh treats hyphens while indexing and searching?
Whoosh simply treats all punctuation as a space. Assuming a default AND
search, the query dual-scale thermometer
is equivalent to dual AND scale AND thermometer
. This will find a document containing dual-scale digital thermometer
, but it will also find dual purpose bathroom scale with thermometer
.
One solution to avoid this is to turn the hyphenated words in your query into phrases: "dual-scale" thermometer
, which is the equivalent of "dual scale" AND thermometer
.
You could also force Whoosh to accept hyphens as part of a word. You do this by overriding the RegexTokenizer
expression in the StandardAnalyzer
with a regular expression that accepts hyphens as a valid part of a token.
from whoosh import fields, analysis
myanalyzer = analysis.StandardAnalyzer(expression=r'[\w-]+(\.?\w+)*')
schema = fields.Schema(myfield=fields.TEXT(analyzer=myanalyzer))
Now a search for dual-scale thermometer
is equivalent to dual-scale AND thermometer
and will find dual-scale digital thermometer
but not "dual purpose bathroom scale with thermometer"
.
However, you won't be able to search for hyphenated words independently. If your document contained high-quality components
, you would not match it if you searched for quality
; only high-quality
, because this has now become one token. Because of this side-effect, unless your content is strictly constrained in its use of hyphens to truly atomic hyphenated words, I would recommend the phrase approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With