Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sitecore Lucene index search term with space match same word without space

This seems so simple that I'm convinced I must be overlooking something. I cannot establish how to do the following in Lucene:

The problem

  • I'm searching for place names.
  • I have a field called Name
  • It is using Lucene.Net.Analysis.Standard.StandardAnalyzer
  • It is TOKENIZED
  • The value of Name contains 1 space in the value: halong bay.
  • The search term may or may not contain an extra space due to culturally different spellings or genuine spelling mistakes. E.g. ha long bay instead of halong bay.
  • If I use the term halong bay I get a hit.
  • If I use the term ha long bay I do not get a hit.

The attempted solution

Here's the code I'm using to build my predicate using LINQ to Lucene from Sitecore:

var searchContext = ContentSearchManager.GetIndex("my_index").CreateSearchContext();
var term = "ha long bay";
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Name == term);
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);

I have also tried a fuzzy match using the .Like() extension:

var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Like(term));

This also yields no results for ha long bay.

How do I configure Lucene in Sitecore to return a hit for both halong bay and ha long bay search terms, ideally without having to do anything fancy with the input term (e.g. stripping space, adding wildcards, etc)?

Note: I recognise that this would also allow the term h a l o n g b a y to produce a hit, but I don't think I have a problem with this.

like image 629
theyetiman Avatar asked Aug 17 '16 16:08

theyetiman


1 Answers

A TOKENIZED field means that the field value is split by a token (space in that case) and the resulting terms are added to the index dictionary. If you index "halong bay" in such a field, it will create the "halong" and "bay" terms.

It's normal for the search engine to fail to retrieve this result for the "ha long" search query because it doesn't know any result with the "ha" or "long" terms.

A manual approach would be to define all the other ways to write the place name in another multi-value computed index field named AlternateNames. Then you could issue this kind of query: Name==query OR AlternateNames==query.

An automatic approach would be to also index the place names without spaces in a separate computed index field named CompactName. Then you could issue this kind of query: Name==query OR CompactName==compactedQueryWithoutSpaces

I hope this helps

Jeff

like image 107
Jean-François L'Heureux Avatar answered Oct 10 '22 20:10

Jean-François L'Heureux