Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr compound word tokenizer - results treated as OR statement

Tags:

filter

solr

The Dutch and German language do have words that can be combined to new words; compound words.

For example "accountmanager" is considered one word, compounded by the words "account" and "manager". Our users, will use "accountmanager" and "account manager" in documents and queries, and expect the same results for both queries.

To be able to decompound (split) words, solr has a dictionary filter that I have configured in the schema:

<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="../../compound-word-dictionary.txt" minWordSize="8" minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>

The compound-word-dictionary.txt file holds a list of words that are used to decompound compounded words. In this list you will find for example the words "account" and "manager".

The decompound result is ok, when analyzed in the Solr debugger when searching with query "accountmanager": (term text):

  • accountmanager
  • account
  • manager

This result however, is treated as an OR statement, and finds all documents that have at least one of the terms in it. I want it to behave like an AND statement (so I want only the results that have both the terms "account" and "manager" in the document).

I have tried setting the defaultOperator in the schema to "AND", but this is ignored when using edismax. So I have set the proposed Min-should-Match to 100% (mm=100%), again without any desired result. Tweaking the attributes of the dictionary filter in the schema does not change the behavior to "AND".

Does anybody came across this behavior when using the dictionary compound word token factory and knows a solution to let it behave like an AND statement?

like image 708
Sebastiaan Ordelman Avatar asked Jun 18 '12 09:06

Sebastiaan Ordelman


1 Answers

it is working as expected, the DictionaryCompoundWordTokenFilterFactory is just adding the 'inner words' it found, in this case both 'account' and 'manager' but could have been just one, if for example the word was 'accountbanana' and 'banana' is not in the dictionary only 'account' would have been added.

This serves the purpose of someone looking for 'manager' and also finding the doc that has 'accountmanager'.

In order to get the behaviour you want (I understand you are applying this on the query side) you could use a dictionary that makes accountmanager="account manager"

like image 132
Persimmonium Avatar answered Nov 14 '22 21:11

Persimmonium