I've been working on a Lucene document search program for the last few days and everything has been overall going well, until now. I'm trying to use the Lucene.Net.Highlight.Highlighter
class to show relevant snippets for my search results, but it isn't working consistently. Most of the time the calling Highlighter.GetBestFragments()
does exactly what I'd expect (shows relevant text snippets with the given query string in them), but sometimes it just returns an empty string.
I've triple checked my inputs and I can verify that the query string I'm using exists in the input text, but the highlighter just arbitrarily returns an empty string sometimes. The problem is reproducible; documents that have blank fragments returned will continue to have blank fragments returned when using the same query, while documents that have legitimate fragments continue to have legitimate fragments.
However, The problem is NOT document-specific. Some queries return valid fragments for a document where other queries return an empty string for the same document. The problem also does not appear to be related to my analyzer; the problem shows up whether I use a StandardAnalyzer
or a SnowballAnalyzer
.
After many hours of poking around I have been unable to find any pattern in the queries/documents that fail versus those that work. Keep in mind that this is happening on documents that were specifically pulled back from the Lucene index using the exact same query. That means the Searcher
is able to find the relevant query string in the target document but the Highlighter
is not.
Is this a bug in Lucene? If so, how can I work around it?
My code:
private static SimpleHTMLFormatter _formatter = new SimpleHTMLFormatter("<b>", "</b>");
private static SimpleFragmenter _fragmenter = new SimpleFragmenter(50);
...
{
using (var searcher = new IndexSearcher(analyzerInfo.Directory, false))
{
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "Text", analyzerInfo.Analyzer);
parser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
//build query
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.Add(new TermQuery(new Term("PageNum", "0")), BooleanClause.Occur.MUST);
booleanQuery.Add(parser.Parse(searchQuery), BooleanClause.Occur.MUST);
Query query = booleanQuery.Rewrite(searcher.GetIndexReader());
//get results from query
ScoreDoc[] hits = searcher.Search(query, 50).ScoreDocs;
List<DVDoc> results = hits.Select(hit => MapLuceneDocumentToData(searcher.Doc(hit.Doc))).ToList();
//add relevant fragments to search results (shows WHY a certain result was chosen)
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(_formatter, scorer);
highlighter.SetTextFragmenter(_fragmenter);
foreach (DVDoc result in results)
{
TokenStream stream = analyzerInfo.Analyzer.TokenStream("Text", new StringReader(result.Text));
result.RelevantFragments = highlighter.GetBestFragments(stream, result.Text, 3, "...");
}
//clean up
analyzerInfo.Analyzer.Close();
searcher.Close();
return results;
}
}
(Note: DVDoc
is essentially just a struct which stores info about documents that were found. The method MapLuceneDocumentToData
turns a Lucene Document
into my custom DVDoc
class, no magic there.)
And since everyone likes example inputs and outputs:
I'm using Lucene.NET Version 2.9.4g.
By default the Highlighter will only process the first 51200 chars of a Document.
To increase this limit, set the MaxDocCharsToAnalyze
property.
http://lucene.apache.org/core/old_versioned_docs/versions/2_9_2/api/contrib-highlighter/org/apache/lucene/search/highlight/Highlighter.html#setMaxDocCharsToAnalyze(int)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With