Given the following set of values how do I configure the field to return values that are partial word matches but that also match the entire search term?
Values:
Texas State University
Stanford University
St. Johns College
Search Term: sta
Desired Results:
Texas State University
Stanford University
Search Term: stan
Desired Results:
Stanford University
Search Term: st un
Desired Results:
Texas State University
Stanford University
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
</fieldType>
I think my problem is with the EdgeNGramFilterFactory
. As shown above, the second search for stan
returns all three of the values shown instead of only Stanford
. But, without the EdgeNGramFilterFactory
, partial words don't match at all.
What is the correct configuration for a Solr field to return values that are partial word matches but that also match the entire search term?
I had similar kind of requirment and tried this ... created different field Type...
<fieldType name="text_reference" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I another another requirement... The below blog will explain it in detail
https://www.blogger.com/blogger.g?blogID=8592878860404675342#editor/target=post;postID=6309840933546641223;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=33;src=postname
I think I figured it out. I definitely welcome other answers and additional corrections though.
The solution appears to be to only use the EdgeNGramFilterFactory
when indexing, not when querying. This makes sense when you think about it. I want n-grams when indexing but only want to match the actual search term when querying.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
You can use
N-Gram Filter
Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by gramsize.
Factory class:solr.NGramFilterFactory
Arguments:
minGramSize: (integer, default 1) The minimum gram size. maxGramSize: (integer, default 2) The maximum gram size.
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory"/>
</analyzer>
In: "four score"
Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"
http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.3.pdf#page=112&zoom=auto,-187,475
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With