Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a case insensitive copy of a string field in SOLR?

How can I create a copy of a string field in case insensitive form? I want to use the typical "string" type and a case insensitive type. The types are defined like so:

    <fieldType name="string" class="solr.StrField"
        sortMissingLast="true" omitNorms="true" />

    <!-- A Case insensitive version of string type  -->
    <fieldType name="string_ci" class="solr.StrField"
        sortMissingLast="true" omitNorms="true">
        <analyzer type="index">
            <tokenizer class="solr.KeywordTokenizerFactory"/>           
            <filter class="solr.LowerCaseFilterFactory" />
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
        </analyzer>
    </fieldType> 

And an example of the field like so:

<field name="destANYStr" type="string" indexed="true" stored="true"
    multiValued="true" />
<!-- Case insensitive version -->
<field name="destANYStrCI" type="string_ci" indexed="true" stored="false" 
    multiValued="true" />

I tried using CopyField like so:

<copyField source="destANYStr" dest="destANYStrCI" />

But, apparently CopyField is called on source and dest before any analyzers are invoked, so even though I've specified that dest is case-insensitive through anaylyzers the case of the values copied from source field are preserved.

I'm hoping to avoid re-transmitting the value in the field from the client, at record creation time.

like image 451
harschware Avatar asked Jan 12 '10 23:01

harschware


2 Answers

With no answers from SO, I followed up on the SOLR users list. I found that my string_ci field was not working as expected before even considering the effects of copyField. Ahmet Arslan explains why the "string_ci" field should be using solr.TextField and not solr.StrField:

From apache-solr-1.4.0\example\solr\conf\schema.xml :

"The StrField type is not analyzed, but indexed/stored verbatim."

"solr.TextField allows the specification of custom text analyzers specified as a tokenizer and a list of token filters."

With an example he provdied and a slight tweak by myself, the following field definition seems to do the trick, and now the CopyField works as expected as well.

    <fieldType name="string_ci" class="solr.TextField"
        sortMissingLast="true" omitNorms="true">
        <analyzer>
            <tokenizer class="solr.KeywordTokenizerFactory"/>           
            <filter class="solr.LowerCaseFilterFactory" />
        </analyzer>
    </fieldType> 

The destANYStrCI field will have a case preserved value stored but will provide a case insensitive field to search on. CAVEAT: case insensitive wildcard searching cannot be done since wild card phrases bypass the query analyzer and will not be lowercased before matching against the index. This means that the characters in wildcard phrases must be lowercase in order to match.

like image 139
harschware Avatar answered Nov 12 '22 02:11

harschware


Yes true. LowerCaseFilterFactory does not applies to String data type. We could only apply LowerCaseFilterFactory on Text fields.

If you try to do this way

<!-- Assigning customised data type -->
<field name="language" type="text_lower" indexed="true" stored="true"  multiValued="false" default="en"/>  

<!-- Defining customised data type for lower casing. -->
<fieldType name="text_lower" class="solr.String" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

It would not work, We have to use TextField.

Try this way, it should work. Just change the fieldType from String to TextField

<!-- Assigning customised data type -->
<field name="language" type="text_lower" indexed="true" stored="true"  multiValued="false" default="en"/>  

<!-- Defining customised data type for lower casing. -->
<fieldType name="text_lower" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
like image 32
Manjunath Reddy Avatar answered Nov 12 '22 03:11

Manjunath Reddy