Solr Search Failing on Certain Characters

Question

I have a Solr collection that's not returning results for a few non-ASCII characters. The example we are using is the string S11. • “≡ «Ñaïvétý» ‘¢¥£’ ¶!#%; searching for that whole string returns no results even though I have an object with that in an indexed field. However, searching for substrings of that string does return matches. The only characters that cause Solr to return no matches are three in the middle: • “≡. The field was indexed as text_en but I have also tried edge_ngram (hoping for a bit of Cargo Cult magic to fix the problem). Is there something special about these three characters or do I need to tweak how Solr indexes the fields?

We are searching via django-haystack but the problem shows up in the Solr admin as well.

Here are the two field type definitions:

<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory"        
            generateWordParts="1" generateNumberParts="1" catenateWords="0" 
            catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.EdgeNGramFilterFactory" 
            minGramSize="2" maxGramSize="50" side="front" />
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" generateNumberParts="1" catenateWords="0" 
            catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    </analyzer>
</fieldType>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
            ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
</fieldType>

jay · Accepted Answer

Have you tried using ASCIIFoldingFilterFactory

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>

Abhijit Bashetti · Answer

Can you try this ...

<fieldType name="text_reference" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="front"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="back"/>
       </analyzer>
       <analyzer type="query">
          <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
    </fieldType>

enter image description here

Solr Search Failing on Certain Characters

Tags:

solr

Tom

2 Answers

jay

Abhijit Bashetti

Recent Activity

Donate For Us

Solr Search Failing on Certain Characters

Tags:

solr

Tom

2 Answers

jay

Abhijit Bashetti

Related questions

Recent Activity

Donate For Us