Search for short words with SOLR

Question

I am using SOLR along with NGramTokenizerFactory to help create search tokens for substrings of words

NGramTokenizer is configured with a minimum word length of 3

This means that I can search for e.g. "unb" and then match the word "unbelievable".

However I have a problem with short words like "I" and "in". These are not indexed by SOLR (I suspect it is because of NGramTokenizer) and therefore I cannot search for them.

I don't want to reduce the minimum word length to 1 or 2, since this creates a huge search index. But I would like SOLR to include whole words whose length is already below this minimum.

How can I do that?

/Carsten

Luca Molteni · Accepted Answer

First of all, try to understand why your words don't get indexed by solr using the "Analysis Tool"

http://localhost:8080/solr/admin/analysis.jsp

Just put the field and the text you are searching for and see which analyser is filtering your short term. I suggest you to do so because you said you have only a "suspect" and you have to be certain about which analyser filters your data.

Then why don't you just simply copy the term in another field without that analyser?

In this way your terms will be indexed twice, and will appear both as exact word and as n-gram. Then you have to deal with the scores of the two different fields.

I hope this has helped you in some way.

Some link for aggregation and copyfield attribute:

Indexing data in multiple fields

Using copy field tag

Random Person · Answer

I was just having a similar problem where I was trying to keep short words without creating a huge solr index.

So I came up with a simpler solution that doesn't need any new fields or copied values:

  <!-- Keep small words safe from the n-gram filter -->
  <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{2})$" replacement=" $1"/>

  <!-- Do the n-gramming -->
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="25"/>
  <filter class="solr.ReverseStringFilterFactory"/>
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="25"/>
  <filter class="solr.ReverseStringFilterFactory"/>

  <!-- Remove the padding spaces -->
  <filter class="solr.TrimFilterFactory"/>

This will add just enough spaces to a short word to get it to the minGramSize and since it just has the minimal size the NGram-filter will leave it as it is.

Add additional PatternReplaceFilterFactory-filters if needed.

<!-- Protect single characters! (Two spaces) -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^(.{1})$" replacement="  $1"/>

Search for short words with SOLR

Tags:

solr

lucene

Carsten Gehling

2 Answers

Luca Molteni

Random Person

Recent Activity

Donate For Us

Search for short words with SOLR

Tags:

solr

lucene

Carsten Gehling

2 Answers

Luca Molteni

Random Person

Related questions

Recent Activity

Donate For Us