Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr Partial And Full String Match

I am trying to allow searches on partial strings in Solr so if someone searched for "ppopota" they'd get the same result as if they searched for "hippopotamus." I read the documentation up and down and feel like I have exhausted my options. So far I have the following:

Defining a new field type:

<fieldtype name="testedgengrams" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.LowerCaseTokenizerFactory"/>
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
</fieldtype>

Defining a field of type "testedgengrams":

<field name="text_ngrams" type="testedgengrams" indexed="true" stored="false"/>

Copying contents of text_ngrams into text:

<copyField source="text_ngrams" dest="text"/>

Alas, that doesn't work. What am I missing?

like image 229
Scripthead Avatar asked Jan 28 '11 04:01

Scripthead


4 Answers

You're using EdgeNGramFilterFactory which generates tokens 'hi', 'hip', 'hipp', etc, so it won't match 'ppopota'. Use NGramFilterFactory instead.

like image 148
Mauricio Scheffer Avatar answered Nov 09 '22 16:11

Mauricio Scheffer


To enable partial word searching

you must edit your local schema.xml file, usually under solr/config, to add either:

  1. NGramFilterFactory
  2. EdgeNGramFilterFactory

Here's what mine looks like: sample solr schema.xml

Here's the line to paste:

<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>

EdgeNGram

I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.

Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.

Take it to the next level

5 ways to speed up indexing

Special instructions for editing the websolr schema.xml if you are using Heroku

  1. Go to the Heroku online dashboard for your app
  2. Go to the resources tab, then click on the Websolr add-on
  3. Click the default link under Indexes
  4. Click on the Advanced Configuration link
  5. Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
  6. Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
  7. Click the Index Status link to get nifty stats and see if you are running fast or slow.
  8. Reindex everything

heroku run rake sunspot:reindex[5000]

  • Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
  • Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+
like image 41
Aaron Henderson Avatar answered Nov 09 '22 16:11

Aaron Henderson


Ok I'm doing the same thing with field name

name_de

And I managed to get this thing to work using copyField like this:

schema.xml

<schema name="solr-magento" version="1.2">
    <types>
       ...
        <fieldType name="type_name_de_partial" class="solr.TextField">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
                <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.TrimFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.TrimFilterFactory" />
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords_de.txt"/>
            </analyzer>
        </fieldType>
    </types>

    ...

    <fields>
        ...
        <field name="name_de_partial" type="type_name_de_partial" indexed="true" stored="true"/>
    </fields>

    ....

    <copyField source="name_de" dest="name_de_partial" />
</schema>

Then create search condition in solrconfig.xml

<requestHandler name="magento_de" class="solr.SearchHandler">
    <lst name="defaults">
        <str name="defType">dismax</str>
        <str name="echoParams">explicit</str>
        <str name="tie">0.01</str>                                          <!-- Tie breaker -->
        <str name="qf">name_de_partial^1.0 name_de^3.0</str>                <!-- Phrase Fields -->
        <str name="pf">name_de_partial^1.0 name_de^3.0</str>                <!-- Phrase Fields -->
        <str name="mm">3&lt;90%</str>                                       <!-- Minimum 'Should' Match [id 1..3 must much all, else 90proc] -->
        <int name="ps">100</int>                                            <!-- Phrase Slop -->
        <str name="q.alt">*:*</str>
        ..
    </lst>
    <arr name="last-components">
        <str>spellcheck</str>
    </arr>
</requestHandler>

With this solr is searching in fields name_de_partial with pow 1.0 and in name_de with pow 3.0

So if engine founds specific query word in name_de, then it is put on top of the list. If he also finds something in name_de_partial then it also counts and is put in results.

And field name_de_partial is using specific solr filters so it can found word "hippie" using query "hip" or "ppie" or "ippi" without a swet.

like image 10
wormhit Avatar answered Nov 09 '22 17:11

wormhit


If you set EdgeNGramFilterFactory or NGramFilterFactory both at index and query time, combined with q.op=AND (or default mm=100% if you are using dismax) you will experience some problems.

Try defining NGramFilterFactory only at index time:

<fieldType name="testedgengrams" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldType>

or try setting q.op=OR (or mm=1 if you are using dismax)

like image 7
Andre85 Avatar answered Nov 09 '22 17:11

Andre85