The last couple of days we are thinking of using Solr as our search engine of choice. Most of the features we need are out of the box or can be easily configured. There is however one feature that we absolutely need that seems to be well hidden (or missing) in Solr. I'll try to explain with an example. We have lots of documents that are actually businesses: <pre class="prettyprint"><code><document> <name>Apache</name> <cat>1</cat> ... </document> <document> <name>McDonalds</name> <cat>2</cat> ... </document> </code></pre> In addition we have another xml file with all the categories and synonyms: <pre class="prettyprint"><code><cat id=1> <name>software</name> <synonym>IT<synonym> </cat> <cat id=2> <name>fast food</name> <synonym>restaurant<synonym> </cat> </code></pre> We want to associate both businesses and categories so we can search using the name and/or synonyms of the category. But we do not want to merge these files at indexing time because we should update the categories (adding.remioving synonyms...) without indexing all the businesses again. Is there anything in Solr that does this kind of associations or do we need to develop some specific pieces? All feedback and suggestions are welcome. Thanks in advance, Tom

There is actually a filter class called solr.SynonymFilterFactory. This should allow you to map the cat numbers to its 2 text equivalents, if you use it in the query analyser only, something like the following: <pre class="prettyprint"><code> <fieldType name="category" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="category_Synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> </code></pre> That way you can index ONLY the category ID. This means you won't have to send all the businesses to Solr again. Also if someone queries "software"or "IT" it will map it to the category Your category_Synonyms.txt should have lines such as the following: <blockquote> 1, software, IT </blockquote> The onlydraw back here is that you'll have to come up with a way of editing the text document when you change the names or synonyms. So i guess this will only help if you change the category names infrequently?? Unless someone else knows of a way that this can be done easily. I actually added the above to my own solr and ran the Analyser tool on it.. here is the result: <img src="https://farm5.static.flickr.com/4052/4545705074_5dd70b0d2e_o_d.png" alt="alt text"> As you can see it's turned software into 1 Please note you MUST set the <blockquote> expand </blockquote> parameter to <blockquote> false </blockquote> I hope this helps. Dave

Solr associations

Tags:

solr

lucene

search-engine

The last couple of days we are thinking of using Solr as our search engine of choice. Most of the features we need are out of the box or can be easily configured. There is however one feature that we absolutely need that seems to be well hidden (or missing) in Solr.

I'll try to explain with an example. We have lots of documents that are actually businesses:

Click to copy

<document>
  <name>Apache</name>
  <cat>1</cat>
  ...
</document>
<document>
  <name>McDonalds</name>
  <cat>2</cat>
  ...
</document>

In addition we have another xml file with all the categories and synonyms:

Click to copy

<cat id=1>
  <name>software</name>
  <synonym>IT<synonym>
</cat>
<cat id=2>
  <name>fast food</name>
  <synonym>restaurant<synonym>
</cat>

We want to associate both businesses and categories so we can search using the name and/or synonyms of the category. But we do not want to merge these files at indexing time because we should update the categories (adding.remioving synonyms...) without indexing all the businesses again.

Is there anything in Solr that does this kind of associations or do we need to develop some specific pieces?

All feedback and suggestions are welcome.

Thanks in advance, Tom

710

asked Apr 22 '10 08:04

Tom

2 Answers

Basically you have a design decision here. The usual thing people do with Solr indexes is to denormalize them, i.e. explode the category definition into the business' document. As you do not want to do this, I suggest keeping two types of documents - one for the businesses and another for the categories.You can keep both in the same index, as Solr does not require all documents to have the same fields. The business documents seem straightforward, but you have to make them searchable by both the business name and the category id. I suggest creating a category document for each synonym, where you search by synonym and find the id (and category name).

To search using synonyms, you will need a double search -

Search for category id using the name's text.
Search for businesses using the category id.

130

answered Oct 06 '22 20:10

Yuval F

There is actually a filter class called solr.SynonymFilterFactory.

This should allow you to map the cat numbers to its 2 text equivalents, if you use it in the query analyser only, something like the following:

Click to copy

    <fieldType name="category" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="category_Synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

That way you can index ONLY the category ID. This means you won't have to send all the businesses to Solr again. Also if someone queries "software"or "IT" it will map it to the category

Your category_Synonyms.txt should have lines such as the following:

1, software, IT

The onlydraw back here is that you'll have to come up with a way of editing the text document when you change the names or synonyms. So i guess this will only help if you change the category names infrequently?? Unless someone else knows of a way that this can be done easily.

I actually added the above to my own solr and ran the Analyser tool on it.. here is the result:

alt text

As you can see it's turned software into

Please note you MUST set the

expand

parameter to

false

I hope this helps.

Dave

answered Oct 06 '22 20:10

CraftyFella

Related questions
                            
                                Is there a way to remove the calculation of length norms for fields in elastic search?
                            
                                Lucene vs SQLite Full Text Search for Android Application
                            
                                Position of document in result set in Solr
                            
                                OrientDB: FullText indexes vs Lucene FullText indexes
                            
                                Is it possible to returned the analyzed fields in an ElasticSearch >2.0 search?
                            
                                Sitecore Lucene index search term with space match same word without space
                            
                                Matching entire sentence with spaces in lucene BooleanQuery
                            
                                How to define a primary key field in a Lucene document to get the best lookup performance?
                            
                                Lucene Index upgrading from version 4.6 to 8.0.0
                            
                                Index replication and Load balancing
                            
                                Multi-Term Wildcard queries in Lucene?
                            
                                How to count term frequency for set of documents?
                            
                                Wildcard search in Solr
                            
                                get all results with Dismax, like q=*:*?
                            
                                Can solr return function values (not solr score or document fields)?
                            
                                Lucene - Reading all field names that are stored
                            
                                Elasticsearch lucene query in grafana
                            
                                Elasticsearch multi-match cross fields query with different query analyzers
                            
                                Lucene RangeQuery doesn't filter appropriately
                            
                                In a Lucene / Lucene.net search, how do I count the number of hits per document?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Solr associations

Tags:

solr

lucene

search-engine

Tom

People also ask

2 Answers

Yuval F

CraftyFella

Recent Activity

Donate For Us