Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene Hightlighter sometimes inexplicably returns blank fragments

Tags:

c#

lucene.net

I've been working on a Lucene document search program for the last few days and everything has been overall going well, until now. I'm trying to use the Lucene.Net.Highlight.Highlighter class to show relevant snippets for my search results, but it isn't working consistently. Most of the time the calling Highlighter.GetBestFragments() does exactly what I'd expect (shows relevant text snippets with the given query string in them), but sometimes it just returns an empty string.

I've triple checked my inputs and I can verify that the query string I'm using exists in the input text, but the highlighter just arbitrarily returns an empty string sometimes. The problem is reproducible; documents that have blank fragments returned will continue to have blank fragments returned when using the same query, while documents that have legitimate fragments continue to have legitimate fragments.

However, The problem is NOT document-specific. Some queries return valid fragments for a document where other queries return an empty string for the same document. The problem also does not appear to be related to my analyzer; the problem shows up whether I use a StandardAnalyzer or a SnowballAnalyzer.

After many hours of poking around I have been unable to find any pattern in the queries/documents that fail versus those that work. Keep in mind that this is happening on documents that were specifically pulled back from the Lucene index using the exact same query. That means the Searcher is able to find the relevant query string in the target document but the Highlighter is not.

Is this a bug in Lucene? If so, how can I work around it?

My code:

private static SimpleHTMLFormatter _formatter = new SimpleHTMLFormatter("<b>", "</b>");
private static SimpleFragmenter _fragmenter = new SimpleFragmenter(50);
...
{
    using (var searcher = new IndexSearcher(analyzerInfo.Directory, false))
    {
        QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "Text", analyzerInfo.Analyzer);
        parser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

        //build query
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.Add(new TermQuery(new Term("PageNum", "0")), BooleanClause.Occur.MUST);
        booleanQuery.Add(parser.Parse(searchQuery), BooleanClause.Occur.MUST);
        Query query = booleanQuery.Rewrite(searcher.GetIndexReader());

        //get results from query
        ScoreDoc[] hits = searcher.Search(query, 50).ScoreDocs;
        List<DVDoc> results = hits.Select(hit => MapLuceneDocumentToData(searcher.Doc(hit.Doc))).ToList();

        //add relevant fragments to search results (shows WHY a certain result was chosen)
        QueryScorer scorer = new QueryScorer(query);
        Highlighter highlighter = new Highlighter(_formatter, scorer);
        highlighter.SetTextFragmenter(_fragmenter);
        foreach (DVDoc result in results)
        {
            TokenStream stream = analyzerInfo.Analyzer.TokenStream("Text", new StringReader(result.Text));
            result.RelevantFragments = highlighter.GetBestFragments(stream, result.Text, 3, "...");
        }

        //clean up
        analyzerInfo.Analyzer.Close();
        searcher.Close();

        return results;
    }
}

(Note: DVDoc is essentially just a struct which stores info about documents that were found. The method MapLuceneDocumentToData turns a Lucene Document into my custom DVDoc class, no magic there.)

And since everyone likes example inputs and outputs:

  • Example of GetBestFragments working
  • Example of GetBestFragments NOT working

I'm using Lucene.NET Version 2.9.4g.

like image 697
ean5533 Avatar asked May 23 '12 18:05

ean5533


1 Answers

By default the Highlighter will only process the first 51200 chars of a Document.

To increase this limit, set the MaxDocCharsToAnalyze property.

http://lucene.apache.org/core/old_versioned_docs/versions/2_9_2/api/contrib-highlighter/org/apache/lucene/search/highlight/Highlighter.html#setMaxDocCharsToAnalyze(int)

like image 145
Jf Beaulac Avatar answered Sep 19 '22 12:09

Jf Beaulac