Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search for short words with SOLR

Tags:

solr

lucene

I am using SOLR along with NGramTokenizerFactory to help create search tokens for substrings of words

NGramTokenizer is configured with a minimum word length of 3

This means that I can search for e.g. "unb" and then match the word "unbelievable".

However I have a problem with short words like "I" and "in". These are not indexed by SOLR (I suspect it is because of NGramTokenizer) and therefore I cannot search for them.

I don't want to reduce the minimum word length to 1 or 2, since this creates a huge search index. But I would like SOLR to include whole words whose length is already below this minimum.

How can I do that?

/Carsten

like image 279
Carsten Gehling Avatar asked Jun 11 '10 08:06

Carsten Gehling


2 Answers

First of all, try to understand why your words don't get indexed by solr using the "Analysis Tool"

http://localhost:8080/solr/admin/analysis.jsp

Just put the field and the text you are searching for and see which analyser is filtering your short term. I suggest you to do so because you said you have only a "suspect" and you have to be certain about which analyser filters your data.

Then why don't you just simply copy the term in another field without that analyser?

In this way your terms will be indexed twice, and will appear both as exact word and as n-gram. Then you have to deal with the scores of the two different fields.

I hope this has helped you in some way.

Some link for aggregation and copyfield attribute:

Indexing data in multiple fields

Using copy field tag

like image 148
Luca Molteni Avatar answered Oct 31 '22 09:10

Luca Molteni


I was just having a similar problem where I was trying to keep short words without creating a huge solr index.

So I came up with a simpler solution that doesn't need any new fields or copied values:

  <!-- Keep small words safe from the n-gram filter -->
  <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{2})$" replacement=" $1"/>

  <!-- Do the n-gramming -->
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="25"/>
  <filter class="solr.ReverseStringFilterFactory"/>
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="25"/>
  <filter class="solr.ReverseStringFilterFactory"/>

  <!-- Remove the padding spaces -->
  <filter class="solr.TrimFilterFactory"/>

This will add just enough spaces to a short word to get it to the minGramSize and since it just has the minimal size the NGram-filter will leave it as it is.

Add additional PatternReplaceFilterFactory-filters if needed.

<!-- Protect single characters! (Two spaces) -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^(.{1})$" replacement="  $1"/>
like image 1
Random Person Avatar answered Oct 31 '22 10:10

Random Person