I'm trying to implement multi-word synonyms in solr, specifically of the type <pre class="prettyprint"><code>msc divina => divina </code></pre> So, if a user enters "msc divina", solr should return results for "divina" only. The definition in schema.xml looks like this: <pre class="prettyprint lang-xml prettyprint-override"><code><fieldType name="text_de" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" ignoreCase="true" expand="false" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_de.txt" /> <filter class="solr.SnowballPorterFilterFactory" language="German2" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords_de.txt" /> <filter class="solr.SnowballPorterFilterFactory" language="German2" /> </analyzer> </fieldType> </code></pre> It doesn't work. If I add a synonym filter to the query analyzer, a search on "msc divina" returns every hit for "msc and "divina". How can I solve this?

Starting from Solr 6.4 for multi-word synonyms you need to use <code>solr.SynonymGraphFilterFactory</code> <blockquote> This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms. If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Synonym Filter, because the indexer can’t directly consume a graph. To get fully correct positional queries when your synonym replacements are multiple tokens, you should instead apply synonyms using this filter at query time. </blockquote> Example of the analyzer for index time: <pre class="prettyprint"><code><analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/> <filter class="solr.FlattenGraphFilterFactory"/>  </analyzer> </code></pre> Since now token streams are graphs - proper arcs would be provided for multiword synonyms for a file <pre class="prettyprint"><code>fast → speedy wi fi → wifi wi fi network → hotspot </code></pre> <img src="https://i.stack.imgur.com/8AdHS.png" alt="enter image description here"> In this case - multiwords would work properly. Reference to McCandless blog post - http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Here is a solution you will find on internet: https://dzone.com/articles/solution-multi-term-synonyms Other than that, my solution to this problem was domain specific. In my case, I was certain about my query lengths (i.e. less than 200 or there are only 5-10 words). <ol> <li> I have replaced spaces with underscores in synonym entries. Here is one of my synonym entries: <pre class="prettyprint"><code>"like_to":["love_to","loves_to","need_to","needs_to"] </code></pre> </li> <li> Use KeywordTokenizerFactory to send full query for filtering <pre class="prettyprint"><code><tokenizer class="solr.KeywordTokenizerFactory"/> </code></pre> </li> <li> Use ShingleFilterFactory to index/query all possible sub-phrases of sizes between <code>minShingleSize</code> and <code>mazShingleSize</code>. <pre class="prettyprint"><code><filter class="solr.ShingleFilterFactory" minShingleSize="2" outputUnigrams="true" maxShingleSize="3"/> </code></pre> </li> <li> Then use PatternReplaceCharFilterFactory to replace whitespaces to underscores (<code>_</code>) <pre class="prettyprint"><code><charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\s+" replacement="_"/> </code></pre> </li> <li>Use your synonym filter factory.</li> </ol> <h3>Example</h3> Query: <code>I love to travel</code> Tokens: <code>I love, I love to, love to, love to travel, to travel, tavel</code> Replaced with <code>_</code>: <code>I_love, I_love_to, love_to, love_to_travel, to_travel, tavel</code> Synonym filter turns these into: <code>I_love, I_love_to, like_to, love_to_travel, to_travel, tavel</code> So, it will eventually change the <code>love to</code> phrase to <code>like to</code>. Hope this trick helps although it involves costly operations.

Multi word synonyms in solr

Tags:

solr

synonym

I'm trying to implement multi-word synonyms in solr, specifically of the type

msc divina => divina

So, if a user enters "msc divina", solr should return results for "divina" only.

The definition in schema.xml looks like this:

<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100" 
    autoGeneratePhraseQueries="true">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.SynonymFilterFactory"
            synonyms="synonyms_de.txt"
            ignoreCase="true"
            expand="false" />
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_de.txt"
            enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.KeywordMarkerFilterFactory" 
            protected="protwords_de.txt" />
        <filter class="solr.SnowballPorterFilterFactory" language="German2" />
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_de.txt"
            enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="0"
            splitOnCaseChange="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.KeywordMarkerFilterFactory" 
            protected="protwords_de.txt" />
        <filter class="solr.SnowballPorterFilterFactory" language="German2" />
    </analyzer>
</fieldType>

It doesn't work. If I add a synonym filter to the query analyzer, a search on "msc divina" returns every hit for "msc and "divina".

How can I solve this?

707

asked Nov 12 '13 11:11

midnig

3 Answers

Starting from Solr 6.4 for multi-word synonyms you need to use solr.SynonymGraphFilterFactory

This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.

If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Synonym Filter, because the indexer can’t directly consume a graph. To get fully correct positional queries when your synonym replacements are multiple tokens, you should instead apply synonyms using this filter at query time.

Example of the analyzer for index time:

<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
  <filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>

Since now token streams are graphs - proper arcs would be provided for multiword synonyms for a file

fast → speedy
wi fi → wifi
wi fi network → hotspot

enter image description here

In this case - multiwords would work properly.

Reference to McCandless blog post - http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

answered Sep 30 '22 06:09

Mysterion

From Solr documentation:

Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:

The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym. Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document

In here they describe one problem: you can not search for sea biscit and get a match on indexed seabiscuit, unless you use expand=true, but they also explain what happens at the query time with a multi word query which is your case.

msc divina -> msc | divina - phrase query

which will match both msc and divina documents. If you can specify at query time that you a searching for "msc divina" it will work.

Otherwise you need either a multi-word aware tokenizer at the query time or you can expand the FieldQParser plugin to do this for you. You can find more here.

answered Sep 30 '22 08:09

Ion Cojocaru

Here is a solution you will find on internet: https://dzone.com/articles/solution-multi-term-synonyms

Other than that, my solution to this problem was domain specific. In my case, I was certain about my query lengths (i.e. less than 200 or there are only 5-10 words).

I have replaced spaces with underscores in synonym entries. Here is one of my synonym entries:
```
"like_to":["love_to","loves_to","need_to","needs_to"]
```
Use KeywordTokenizerFactory to send full query for filtering
```
<tokenizer class="solr.KeywordTokenizerFactory"/>
```
Use ShingleFilterFactory to index/query all possible sub-phrases of sizes between minShingleSize and mazShingleSize.
```
<filter class="solr.ShingleFilterFactory" minShingleSize="2" outputUnigrams="true" maxShingleSize="3"/>
```

Then use PatternReplaceCharFilterFactory to replace whitespaces to underscores (_)

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\\s+" replacement="_"/>

Use your synonym filter factory.

Example

Query: I love to travel

Tokens: I love, I love to, love to, love to travel, to travel, tavel

Replaced with _: I_love, I_love_to, love_to, love_to_travel, to_travel, tavel

Synonym filter turns these into: I_love, I_love_to, like_to, love_to_travel, to_travel, tavel

So, it will eventually change the love to phrase to like to.

Hope this trick helps although it involves costly operations.

answered Sep 30 '22 08:09

msayef

Related questions
                            
                                Sort different groups using different sort orders in solr
                            
                                Is there a Elasticsearch plugin similar to the Solr analysis tool?
                            
                                simple Solr deployment with two servers for redundancy
                            
                                Solr Did you mean (Spell check component)
                            
                                Error while indexing in solr data crawled by nutch
                            
                                Solr indexing following a Nutch crawl fails, reports "Job Failed"
                            
                                could to find or load main class org.apache.nutch.crawl.InjectorJob
                            
                                Solr Custom Similarity - Using a field from the indexed document
                            
                                Dedicated faceted search engine for dealing with dynamic taxonomies - helps just with performance or also flexibilty?
                            
                                Can Apache Solr Handle TeraByte Large Data
                            
                                Index pdf documents in Solr from C# client
                            
                                Filter irrelevant facets from Solr results
                            
                                what does facet in Solr means?
                            
                                Store complex (i.e. label + id) meta data in SOLR document
                            
                                Best way to filter fields stored in a remote database in solr/lucene?
                            
                                How to implement Solr into Sitecore
                            
                                how to order groups by count in solr
                            
                                How did FaceBook use Cassandra for inbox search, if Caasandra has no search capabilities?
                            
                                Solr vs. Sphinx for spatial search
                            
                                Configure DataImportHandler in SolrCloud with ZooKeeper

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With