Elasticsearch has a built-in "highlight" function which allows you to tag the matched terms in the results (more complicated than it might at first sound, since the query syntax can include near matches etc.).
I have HTML fields, and Elasticsearch stomps all over the HTML syntax when I turn on highlighting.
Can I make it HTML-aware / HTML-safe when highlighting in this way?
I'd like the highlighting to apply to the text in the HTML document, and not to highlight any HTML markup which has matched the search, i.e. a search for "p" might highlight <p>p</p>
-> <p><mark>p</mark></p>
.
My fields are indexed as "type: string
".
The documentation says:
Encoder:
An encoder parameter can be used to define how highlighted text will be encoded. It can be either default (no encoding) or html (will escape html, if you use html highlighting tags).
.. but that HTML-escapes my already HTML-encoded field, breaking things further.
Here are two example queries
The highlight tags are inserted inside other tags, i.e. <p>
-> <<tag1>p</tag1>>
:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "default",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class=\"text\">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
...
html
encoder:The existing HTML syntax is escaped by elasticsearch, which breaks things, i.e. <p>
-> <<tag1>p</tag1>>
:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "html",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class="text">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class="text">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class="text">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
}
...
Elasticsearch does not validate that highlight_query contains the search query in any way so it is possible to define it so legitimate query results are not highlighted. Generally, you should include the search query as part of the highlight_query. Combine matches on multiple fields to highlight a single field.
Elasticsearch supports three highlighters: unified, plain, and fvh (fast vector highlighter). You can specify the highlighter type you want to use for each field. The unified highlighter uses the Lucene Unified Highlighter.
One way to do this is to save the html string with the highlights into a separate variable from the original text. Then you can apply the replacement to the original and save it to the new variable every time a new character is added to the search. That way the mark tags won't interfere with the search.
This is because it will reaplce all instances of the search term, including in HTML attributes such as the href of links. This will result in the link itself having the <mark> element around it:
One way to achieve this is to use the html_strip char filter while analyzing preview_html
field.
This would ensure that while matches would not occur on html markup and hence highlighting would ignore it to as shown in the example below.
Example:
put test
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"my_html": {
"type": "html_strip"
}
},
"analyzer": {
"my_html": {
"tokenizer": "standard",
"char_filter": [
"my_html"
],
"type": "custom"
}
}
}
}
}
}
put test/test/_mapping
{
"properties": {
"preview_html": {
"type": "string",
"analyzer": "my_html",
"search_analyzer": "standard"
}
}
}
put test/test/1
{
"preview_html": "<p> p </p>"
}
post test/test/_search
{
"query": {
"match": {
"preview_html": "p"
}
},
"highlight": {
"fields": {
"preview_html": {}
}
}
}
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.30685282,
"_source": {
"preview_html": "<p> p </p>"
},
"highlight": {
"preview_html": [
"<p> <em>p</em> </p>"
]
}
}
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With