Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch highlight matches in HTML without breaking syntax

Elasticsearch has a built-in "highlight" function which allows you to tag the matched terms in the results (more complicated than it might at first sound, since the query syntax can include near matches etc.).

I have HTML fields, and Elasticsearch stomps all over the HTML syntax when I turn on highlighting.

Can I make it HTML-aware / HTML-safe when highlighting in this way?

I'd like the highlighting to apply to the text in the HTML document, and not to highlight any HTML markup which has matched the search, i.e. a search for "p" might highlight <p>p</p> -> <p><mark>p</mark></p>.

My fields are indexed as "type: string".

The documentation says:

Encoder:

An encoder parameter can be used to define how highlighted text will be encoded. It can be either default (no encoding) or html (will escape html, if you use html highlighting tags).

.. but that HTML-escapes my already HTML-encoded field, breaking things further.

Here are two example queries

  1. Using the default encoder:

The highlight tags are inserted inside other tags, i.e. <p> -> <<tag1>p</tag1>>:

curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
  "query": { "match": { "preview_html": "p" } },
  "highlight": {
    "pre_tags" : ["<tag1>"],
    "post_tags" : ["</tag1>"],
    "encoder": "default",
    "fields": {
      "preview_html" : {}
    }
  },
  "from" : 22, "size" : 1
}'

GIVES:
...
      "highlight" : {
        "preview_html" : [ "<<tag1>p</tag1> class=\"text\">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
      }

...
  1. Using the html encoder:

The existing HTML syntax is escaped by elasticsearch, which breaks things, i.e. <p> -> &lt;<tag1>p</tag1>&gt;:

curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
  "query": { "match": { "preview_html": "p" } },
  "highlight": {
    "pre_tags" : ["<tag1>"],
    "post_tags" : ["</tag1>"],
    "encoder": "html",
    "fields": {
      "preview_html" : {}
    }
  },
  "from" : 22, "size" : 1
}'

GIVES:
...
      "highlight" : {
        "preview_html" : [ "&lt;<tag1>p</tag1> class=&quot;text&quot;&gt;TOP STORIES&lt;&#x2F;<tag1>p</tag1>&gt;&lt;<tag1>p</tag1> class=&quot;text&quot;&gt;Middle East&lt;&#x2F;<tag1>p</tag1>&gt;&lt;<tag1>p</tag1> class=&quot;text&quot;&gt;Syria: Developments in Syria are main story in Middle East&lt;&#x2F;<tag1>p</tag1>&gt;" ]
        }
      }

...
like image 330
Rich Avatar asked Jul 26 '16 09:07

Rich


People also ask

How do I highlight a search query in Elasticsearch?

Elasticsearch does not validate that highlight_query contains the search query in any way so it is possible to define it so legitimate query results are not highlighted. Generally, you should include the search query as part of the highlight_query. Combine matches on multiple fields to highlight a single field.

What type of highlighter does Elasticsearch support?

Elasticsearch supports three highlighters: unified, plain, and fvh (fast vector highlighter). You can specify the highlighter type you want to use for each field. The unified highlighter uses the Lucene Unified Highlighter.

How to search for a string with highlights in HTML?

One way to do this is to save the html string with the highlights into a separate variable from the original text. Then you can apply the replacement to the original and save it to the new variable every time a new character is added to the search. That way the mark tags won't interfere with the search.

Why does the <mark> tag appear around a link in HTML?

This is because it will reaplce all instances of the search term, including in HTML attributes such as the href of links. This will result in the link itself having the <mark> element around it:


1 Answers

One way to achieve this is to use the html_strip char filter while analyzing preview_html field.
This would ensure that while matches would not occur on html markup and hence highlighting would ignore it to as shown in the example below.

Example:

put test
{
   "settings": {
      "index": {
         "analysis": {
            "char_filter": {
               "my_html": {
                  "type": "html_strip"
               }
            },
            "analyzer": {
               "my_html": {
                  "tokenizer": "standard",
                  "char_filter": [
                     "my_html"
                  ],
                  "type": "custom"
               }
            }
         }
      }
   }
}

put test/test/_mapping
{
   "properties": {
      "preview_html": {
         "type": "string",
         "analyzer": "my_html",
         "search_analyzer": "standard"
      }
   }
}

put test/test/1
{
    "preview_html": "<p> p </p>"
}

post test/test/_search
{
   "query": {
      "match": {
         "preview_html": "p"
      }
   },
   "highlight": {
      "fields": {
         "preview_html": {}
      }
   }
}

Results

 "hits": [
         {
            "_index": "test",
            "_type": "test",
            "_id": "1",
            "_score": 0.30685282,
            "_source": {
               "preview_html": "<p> p </p>"
            },
            "highlight": {
               "preview_html": [
                  "<p> <em>p</em> </p>"
               ]
            }
         }
      ]
like image 154
keety Avatar answered Sep 17 '22 17:09

keety