I don't know how to deal with synonyms which contains a space! I have the following config:
The SOLR config file
<fieldType ... >
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
catenateWords="1"
preserveOriginal="1"
splitOnCaseChange="1"
generateWordParts="1"
generateNumberParts="1"
catenateNumbers="1"
catenateAll="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30" side="front"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="2" max="70" />
<filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
My file: syn.txt
st., st => saint
istambul => istanbul
airport, apt => aéroport
NYC => New York
pt., pt => port
brussels => bruxelles
Everything was working fine except the synonym:
"NYC => New York"
I did some research and I found the following:
Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit")
The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arise at query time:
The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" separately, and will not know that they match a synonym.
Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect.
This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term.
For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document
So I tried to changed my config file and to add my filters at the indexing but it is not working.
Did someone have any ideas?
You are doing explicit mapping with =>
.
The Solr documentation says
Explicit mappings match any token sequence on the LHS of "=>" and replace with all alternatives on the RHS. These types of mappings ignore the expand parameter in the schema.
So I am guessing that if you search for NYC
you get nothing back, since it got replaced with New York
at index time.
Instead, can you try declaring them as equivalent synonyms? i.e. like
NYC, New York
instead of NYC => New York
.
Then I believe you can search for either of them and the result will be the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With