Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr: combining EdgeNGramFilterFactory and NGramFilterFactory

Tags:

java

solr

lucene

I have a situation where I need to use both EdgeNGramFilterFactory and NGramFilterFactory.

I am using NGramFilterFactory to perform a "contains" style search with min number of characters as 2. I also want to search for the first letter, like a "startswith" with a front EdgeNGramFilterFactory.

I dont want to lower the NGramFilterFactory to min characters of 1 as I dont want to index all characters.

Some help would be greatly appreciated

Cheers

like image 922
neolaser Avatar asked Aug 30 '11 05:08

neolaser


2 Answers

You don't necessarily have to do all this in the same field. I would create a different fields using different custom types for each treatment so that you can apply the logic separately.

In the following:

  • text contains the original tokens, minimally processed;
  • text_ngram uses the NGramFilter for your two-character-minimum tokens
  • text_first_letter uses EdgeNGram for your one-character initial-letter tokens

If you're processing all text fields in this way, then you might be able to get away with using a copyField to populate the fields. Otherwise, you can instruct your Solr client to send in the same field values for the three separate field types.

When searching, include all of them in your searches with the qf parameter.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
</fieldType>

<fieldType name="text_first_letter" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1" side="front"/>
  </analyzer>
</fieldType>

Setting up field and dynamicField definitions are left up to you. Or let me know if you have more questions and I can edit with clarifications.

like image 84
Nick Zadrozny Avatar answered Oct 01 '22 17:10

Nick Zadrozny


Start by applying the EdgeNgramFilter with min = 1 and max = 1000 (we want the entire original token to be included). Example:

hello => 'h', 'he', 'hel', 'hell', 'hello'

Secondly use the NGramFilter with min = 2. (I will use 2 as the max in the example for simplicity)

'h', 'he', 'hel', 'hell', 'hello' => 'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo'

Now you will have several identical tokens since you have applied the NGramFilter on all "partial" tokens from the EdgeNGramFilter but simply apply the RemoveDuplicatesTokensFilter to remove those.

'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo' => 'h', 'he', 'el', 'll', 'lo'

Now your field will support a single char "startsWith" query and a multiple chars "contains" query.

like image 27
lindstromhenrik Avatar answered Oct 01 '22 17:10

lindstromhenrik