Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

EdgeNGramFilterFactory change in solr5

Short version:

Does anyone knows if something happened with EdgeNGramFilterFactory for solr5? It used to work fine on solr 4, but I just upgraded to solr5 and the cores having this fields using this filter refuses to load ...

Long story:

This configuration used to work in solr4.10 (schema.xml):

<field name="NAME" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="PP" type="text_prefix" indexed="true" stored="false" required="false" multiValued="false"/>

<copyField source="NAME" dest="PP">

<fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
</fieldType>

And the documentation says I did it right (no clear mention if it is for solr4 or solr5).

However, when I am trying to add a collection using this configuration, it fails with the following message:

<lst name="failure">
<str>
   org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://localhost:8983/solr: Error CREATEing SolrCore 'test_collection': Unable to create core [test_collection] Caused by: Unknown parameters: {side=front}</str>
</lst>

I removed the side=front "unknown" parameter, started from scratch and it worked - meaning no more errors.

So, while it used to work for solr4 without any additional change, for solr5 it no longer works. Did something changed? Did I miss any doc regarding this filter? Any extra library I need to load to make this work?

And final, if the above is meant to be like this (bug/feature/whatever) - is there any workaround in order to have this "side-substring" indexing-functionality without me having to generate the values when I am adding docs to solr?

Update: with the "hacked" schema (i.e. without side=front), I indexed the documents and changed the PP field to be stored. when I searched, it looks like it indexes the entire value. For example, for NAME:ELEPHANT, I found PP:ELEPHANT ...

like image 949
dcg Avatar asked Mar 02 '15 10:03

dcg


1 Answers

That attribute side has been removed in the context of LUCENE-3907 in Version 4.4. This filter now always behaves as if you gave in side="front". So you may just remove that attribute and are fine, since you are using it the "front-way".

As you can read in the conversation of the linked Lucene Issue

If you need reverse n-grams, you could always add a filter to do that afterwards. There is no need to have this as separate logic in this filter. We should split logic and keep filters as simple as possible.

And this is what has been done. The side attribute has been removed from the filter.

This has been done in Lucene, not directly in Solr. As Lucene is a Java-API it has been mentioned in the Java Doc of the filter

As of Lucene 4.4, this filter does not support EdgeNGramTokenFilter.Side.BACK (you can use ReverseStringFilter up-front and afterward to get the same behavior), handles supplementary characters correctly and does not update offsets anymore.

This may be the reason why you do not find a word about it in the Solr documentation. But this change has also been mentioned in Lucene's Change Log.

like image 65
cheffe Avatar answered Nov 02 '22 08:11

cheffe