Using multiple tokenizers in Solr

Question

What I want to be able to do is perform a query and get results back that are not case sensitive and that match partial words from the index.

I have a Solr schema set up at the moment that has been modified so that I can query and return results no matter what case they are. So, if I search for iPOd, Iwill see iPod returned. The code to do this is:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
  </analyzer>
...
</fieldType>

I have found this code that will allow us to do a partial word match query, but I don't think I can have two tokenizers on one field.

<fieldType name="text" class="solr.TextField" >
  <analyzer type="index">
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
...
</fieldType>

So what can I do to perform this tokenizer on the field as well?
Or is there a way to merge them?
Or is there another way I can accomplish this task?

Mauricio Scheffer · Accepted Answer

Declare another fieldType (i.e. a different name) that has the NGram tokenizer, then declare a field that uses the fieldType with NGram and another field with the standard "text" fieldType. Use copyField to copy one to another. See Indexing same data in multiple fields.

Urobe · Answer

An alternative would be to apply the EdgeGramFilterFactory to the existing field and stay with your current tokenizer (WhitespaceTokenizerFactory), e.g.

<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />

This would keep your current schema unchanged, i.e. you would not need an additional field which has another tokenizer (NGramTokenizerFactory)

Your field look then something like the below:

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
  </analyzer>
...
</fieldType>

Using multiple tokenizers in Solr

Tags:

solr

tokenize

Matt Dell

2 Answers

Mauricio Scheffer

Urobe

Recent Activity

Donate For Us

Using multiple tokenizers in Solr

Tags:

solr

tokenize

Matt Dell

2 Answers

Mauricio Scheffer

Urobe

Related questions

Recent Activity

Donate For Us