Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching entire sentence with spaces in lucene BooleanQuery

I have a search string ,

Tulip INN Riyadhh
 Tulip INN Riyadhh LUXURY
 Suites of Tulip INN RIYAHdhh

I need search term , if i mention

 *Tulip INN Riyadhh*

it has to return all the three above, i have restriction that i have to achieve this without QueryParser or Analyser, it has to be only BooleanQuery/WildCardQuery/etc....

Regards, Raghavan

like image 762
Raghavan Avatar asked May 18 '17 16:05

Raghavan


2 Answers

What you need here is a PhraseQuery. Let me explain.

I don't know which analyzer you're using, but I'll suppose you have a very basic one for simplicity, that just converts text to lowercase. Don't tell me you're not using an anlayzer since it's mandatory for Lucene to do any work, at least at the indexing stage - this is what defines the tokenizer and the token filter chain.

Here's how your strings would be tokenized in this example:

  • tulip inn ryiadhh
  • tulip inn ryiadhh luxury
  • suites of tulip inn ryiadhh

Notice how these all contain the token sequence tulip inn ryiadhh. A sequence of tokens is what a PhraseQuery is looking for.

In Lucene.Net building such a query looks like this (untested):

var query = new PhraseQuery();
query.Add(new Term("propertyName", "tulip"));
query.Add(new Term("propertyName", "inn"));
query.Add(new Term("propertyName", "ryiadhh"));

Note that the terms need to match those produced by the analyzer (in this example, they're all lowercase). The QueryParser does this job for you by running parts of the query through the analyzer, but you'll have to do it yourself if you don't use the parser.

Now, why wouldn't WildcardQuery or RegexQuery work in this situation? These queries always match a single term, yet you need to match an ordered sequence of terms. For instance a WildcardQuery with the term Riyadhh* would find all words starting with Riyadhh.

A BooleanQuery with a collection of TermQuery MUST clauses would match any text that happens to contain these 3 terms in any order - not exactly what you want either.

like image 139
Lucas Trzesniewski Avatar answered Nov 09 '22 16:11

Lucas Trzesniewski


Lucas has the right idea, but there is a more specialized MultiPhraseQuery that can be used to build up a query based on the data that is already in the index to get a prefix match as demonstrated in this unit test. The documentation of MultiPhraseQuery reads:

MultiPhraseQuery is a generalized version of PhraseQuery, with an added method Add(Term[]). To use this class, to search for the phrase "Microsoft app*" first use Add(Term) on the term "Microsoft", then find all terms that have "app" as prefix using IndexReader.GetTerms(Term), and use MultiPhraseQuery.Add(Term[] terms) to add them to the query.

As Lucas pointed out, a *something WildCardQuery is the way to do the suffix match, provided you understand the performance implications.

They can then be combined with a BooleanQuery to get the result you want.

using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
using System;
using System.Collections.Generic;

namespace LuceneSQLLikeSearch
{
    class Program
    {
        static void Main(string[] args)
        {
            // Prepare...
            var dir = new RAMDirectory();
            var writer = new IndexWriter(dir, 
                new IndexWriterConfig(LuceneVersion.LUCENE_48, 
                new StandardAnalyzer(LuceneVersion.LUCENE_48)));

            WriteIndex(writer);

            // Search...
            var reader = writer.GetReader(false);

            // Get all terms that end with tulip
            var wildCardQuery = new WildcardQuery(new Term("field", "*tulip"));
            var multiPhraseQuery = new MultiPhraseQuery();

            multiPhraseQuery.Add(new Term("field", "inn"));

            // Get all terms that start with riyadhh
            multiPhraseQuery.Add(GetPrefixTerms(reader, "field", "riyadhh"));

            var query = new BooleanQuery();
            query.Add(wildCardQuery, Occur.SHOULD);
            query.Add(multiPhraseQuery, Occur.SHOULD);

            var result = ExecuteSearch(writer, query);

            foreach (var item in result)
            {
                Console.WriteLine("Match: {0} - Score: {1:0.0########}", 
                    item.Value, item.Score);
            }

            Console.ReadKey();
        }
    }
}

WriteIndex

public static void WriteIndex(IndexWriter writer)
{
    Document document;

    document = new Document();
    document.Add(new TextField("field", "Tulip INN Riyadhh", Field.Store.YES));
    writer.AddDocument(document);

    document = new Document();
    document.Add(new TextField("field", "Tulip INN Riyadhh LUXURY", Field.Store.YES));
    writer.AddDocument(document);

    document = new Document();
    document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhh", Field.Store.YES));
    writer.AddDocument(document);

    document = new Document();
    document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhhll", Field.Store.YES));
    writer.AddDocument(document);

    document = new Document();
    document.Add(new TextField("field", "myTulip INN Riyadhh LUXURY", Field.Store.YES));
    writer.AddDocument(document);

    document = new Document();
    document.Add(new TextField("field", "some bogus data that should not match", Field.Store.YES));
    writer.AddDocument(document);

    writer.Commit();
}

GetPrefixTerms

Here we scan the index to find all of the terms that start with the passed-in prefix. The terms are then added to the MultiPhraseQuery.

public static Term[] GetPrefixTerms(IndexReader reader, string field, string prefix)
{
    var result = new List<Term>();
    TermsEnum te = MultiFields.GetFields(reader).GetTerms(field).GetIterator(null);
    te.SeekCeil(new BytesRef(prefix));
    do
    {
        string s = te.Term.Utf8ToString();
        if (s.StartsWith(prefix, StringComparison.Ordinal))
        {
            result.Add(new Term(field, s));
        }
        else
        {
            break;
        }
    } while (te.Next() != null);

    return result.ToArray();
}

ExecuteSearch

public static IList<SearchResult> ExecuteSearch(IndexWriter writer, Query query)
{
    var result = new List<SearchResult>();
    var searcherManager = new SearcherManager(writer, true, null);
    // Execute the search with a fresh indexSearcher
    searcherManager.MaybeRefreshBlocking();

    var searcher = searcherManager.Acquire();
    try
    {
        var topDocs = searcher.Search(query, 10);
        foreach (var scoreDoc in topDocs.ScoreDocs)
        {
            var doc = searcher.Doc(scoreDoc.Doc);
            result.Add(new SearchResult
            {
                Value = doc.GetField("field")?.GetStringValue(),
                // Results are automatically sorted by relevance
                Score = scoreDoc.Score,
            });
        }
    }
    catch (Exception e)
    {
        Console.WriteLine(e.ToString());
    }
    finally
    {
        searcherManager.Release(searcher);
        searcher = null; // Don't use searcher after this point!
    }

    return result;
}

SearchResult

public class SearchResult
{
    public string Value { get; set; }
    public float Score { get; set; }
}

If this seems cumbersome, note that QueryParser can mimic a "SQL LIKE" query. As pointed out here, there is an option to AllowLeadingWildCard on QueryParser to build up the correct query sequence easily. It is unclear why you have a constraint that you can't use it, as it is definitely the simplest way to get the job done.

like image 40
NightOwl888 Avatar answered Nov 09 '22 14:11

NightOwl888