Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets

With document having text = "CouchDB"

When i search for "couc"

My highlight is on "cou" and not "couc"


It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found.

It works fine without analyzing the text with term_vector=with_positions_offsets

What's the impact of removing the term_vector=with_positions_offsets for perfomances?

like image 777
Sebastien Lorber Avatar asked Jul 03 '12 02:07

Sebastien Lorber


1 Answers

When you set term_vector=with_positions_offsets for a specific field it means that you are storing the term vectors per document, for that field.

When it comes to highlighting, term vectors allow you to use the lucene fast vector highlighter, which is faster than the standard highlighter. The reason is that the standard highlighter doesn't have any fast way to highlight since the index doesn't contain enough information (positions and offsets). It can only re-analyze the field content, intercept offsets and positions and make highlighting based on that information. This can take quite a while, especially with long text fields.

Using term vectors you do have enough information and don't need to re-analyze the text. The downside is the size of the index, which will notably increase. I must add that since Lucene 4.2 term vectors are better compressed and stored in an optimized way though. And there's also the new PostingsHighlighter based on the ability to store offsets in the postings list, which requires even less space.

elasticsearch uses automatically the best way to make highlighting based on the information available. If term vectors are stored, it will use the fast vector highlighter, otherwise the standard one. After you reindex without term vectors, highlighting will be made using the standard highlighter. It will be slower but the index will be smaller.

Regarding ngram fields, the described behaviour is weird since fast vector highlighter should have a better support for ngram fields, thus I would expect exactly the opposite result.

like image 189
javanna Avatar answered Sep 20 '22 05:09

javanna