Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search in solr with special characters

I have a problem with a search with special characters in solr. My document has a field "title" and sometimes it can be like "Titanic - 1999" (it has the character "-"). When i try to search in solr with "-" i receive a 400 error. I've tried to escape the character, so I tried something like "-" and "\-". With that changes solr doesn't response me with an error, but it returns 0 results.

How can i search in the solr admin with that special character(something like "-" or "'"???

Regards

UPDATE Here you can see my current solr scheme https://gist.github.com/cpalomaresbazuca/6269375

My search is to the field "Title".

excerpt from the schema.xml:

 ...
 <!-- A general text field that has reasonable, generic
     cross-language defaults: it tokenizes with StandardTokenizer,
     removes stop words from case-insensitive "stopwords.txt"
     (empty by default), and down cases.  At query time only, it
     also applies synonyms. -->
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
    </fieldType>
...
<field name="Title" type="text_general" indexed="true" stored="true"/>
like image 915
shinjidev Avatar asked Aug 16 '13 16:08

shinjidev


3 Answers

You are using the standard text_general field for the title attribute. This might not be a good choice. text_general is meant to be for huge chunks of text (or at least sentences) and not so much for exact matching of names or titles.

The problem here is that text_general uses the StandardTokenizerFactory.

 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>
        
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            
        </analyzer>
    </fieldType>

StandardTokenizerFactory does the following:

A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware of the same token types.

This means the '-' character will be completely ignored and be used to tokenize the String.

"kong-fu" will be represented as "kong" and "fu". The '-' disappears.

This does also explain why select?q=title:\- won't work here.

Choose a better fitting field type:

Instead of the StandardTokenizerFactory you could use the solr.WhitespaceTokenizerFactory, that only splits on whitespace for exact matching of words. So making your own field type for the title attribute would be a solution.

Solr also has a fieldtype called text_ws. Depending on your requirements this might be enough.

like image 167
jHilscher Avatar answered Oct 14 '22 10:10

jHilscher


To search for your exact phrase put inverted commas round it:

select?q=title:"Titanic - 1999" 

If you just want to search for that special character then you will need to escape it:

select?q=title:\-

Also check: Special characters (-&+, etc) not working in SOLR Query

If you know exactly which special characters you dont want to use then you can add this to the regex-normalize.xml

<regex> 
  <pattern>&#x2D;</pattern> 
  <substitution>%2D</substitution> 
</regex>

This will replace all "-" with %2D, so when you search, as long as you search for %2D instead of the "-" it will work fine

like image 24
Allan Macmillan Avatar answered Oct 14 '22 10:10

Allan Macmillan


I spent a lot of time getting this done. Here is a clear step-by-step things to be done to query special characters in SolR. Hope it helps someone.

  1. Edit the schema.xml file and find the solr.TextField that you are using.
  2. Under both, "index" and query" analyzers modify the WordDelimiterFilterFactory and add types="characters.txt" Something like:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
     <analyzer type="index">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
    </analyzer>
    </fieldType>
    
  3. Ensure that you use WhitespaceTokenizerFactory as the tokenizer as shown above.

  4. Your characters.txt file can have entries like-

     \# => ALPHA
    @ => ALPHA
    \u0023 => ALPHA
                    ie:- pointing to ALPHA only.
    
  5. Clear the data, re-index and query for the entered characters. It will work.

like image 36
zorze Avatar answered Oct 14 '22 09:10

zorze