Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Highlight on ElasticSearch autocomplete

I have the following data to be indexed on ElasticSearch.

enter image description here

I want to implement an autocomplete feature, and highlight why a specific document matched a query.

This are the settings of my index:

{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 15
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

Index Analyzing

  • Splits text on word boundaries.
  • Removes pontuation.
  • Lowercases
  • Edge NGrams each token

So the Inverted Index looks like:

enter image description here

This is how i defined the mappings for a name field:

{
    "index_type": {
        "properties": {
            "name": {
                "type":     "string",
                "index_analyzer":  "autocomplete", 
                "search_analyzer": "standard" 
            }
        }
    }
}

When I query:

GET http://localhost:9200/index/type/_search

{
    "query": {
        "match": {
            "name": "soft"
        }
    },
    "highlight": {
        "fields" : {
            "name" : {}
        }
    }
}

Search for: soft

Applying the Standard Tokenizer, the "soft" is the term, to find on the inverted index. This search matches the Documents: 1, 3, 4, 5, 6, 7 which is correct, but the highlighted part I would expect to be "soft" and not the whole word:

{
  "hits": [
    {
      "_source": {
        "name": "SoftwareRocks everytime"
      },
      "highlight": {
        "name": [
          "<em>SoftwareRocks</em> everytime"
        ]
      }
    },
    {
      "_source": {
        "name": "Software AG"
      },
      "highlight": {
        "name": [
          "<em>Software</em> AG"
        ]
      }
    },
    {
      "_source": {
        "name": "Software AG2"
      },
      "highlight": {
        "name": [
          "<em>Software</em> AG2"
        ]
      }
    },
    {
      "_source": {
        "name": "Op Software AG good software better"
      },
      "highlight": {
        "name": [
          "Op <em>Software</em> AG good <em>software</em> better"
        ]
      }
    },
    {
      "_source": {
        "name": "Op Software AG"
      },
      "highlight": {
        "name": [
          "Op <em>Software</em> AG"
        ]
      }
    },
    {
      "_source": {
        "name": "is soft ware ok"
      },
      "highlight": {
        "name": [
          "is <em>soft</em> ware ok"
        ]
      }
    }
  ]
}

Search for: software ag

Applying the Standard Tokenizer, the "software ag" is transformed into "software" and "ag", to find on the inverted index. This search matches the Documents: 1, 3, 4, 5, 6, which is correct, but the highlighted part I would expect to be "software" and "ag" and not the whole word around "software" and "ag":

{
  "hits": [
    {
      "_source": {
        "name": "Software AG"
      },
      "highlight": {
        "name": [
          "<em>Software</em> <em>AG</em>"
        ]
      }
    },
    {
      "_source": {
        "name": "Software AG2"
      },
      "highlight": {
        "name": [
          "<em>Software</em> <em>AG2</em>"
        ]
      }
    },
    {
      "_source": {
        "name": "Op Software AG"
      },
      "highlight": {
        "name": [
          "Op <em>Software</em> <em>AG</em>"
        ]
      }
    },
    {
      "_source": {
        "name": "Op Software AG good software better"
      },
      "highlight": {
        "name": [
          "Op <em>Software</em> <em>AG</em> good <em>software</em> better"
        ]
      }
    },
    {
      "_source": {
        "name": "SoftwareRocks everytime"
      },
      "highlight": {
        "name": [
          "<em>SoftwareRocks</em> everytime"
        ]
      }
    }
  ]
}

I read the highlight documentation on elasticsearch, but I cannot understand how the highlighting is performed. For the two examples above I expect only the matched token on the inverted index to be highlighted and not the whole word. Can anyone help how to highlight only the passed value?

Update

So, in seems that on ElasticSearch website, the autocomplete on the server side is similar to my implementation. However it seems that they highlight the matched query on the client. If they do like this, I started to think that there is not a proper solution to do it on ElasticSearch side, so I implemented the highlight feature on server side instead of on client side(as they seem to do).

My implementation on server side(using PHP) is:

public function search($term)
{
    $params = [
        'index' => $this->getIndexName(),
        'type' => $this->getIndexType(),
        'body' => [
            'query' => [
                'match' => [
                    'name' => $term
                ]
            ]
        ]
    ];

    $results = $this->client->search($params);

    $hits = $results['hits']['hits'];

    $data = [];

    $wrapBefore = '<strong>';
    $wrapAfter = '</strong>';

    foreach ($hits as $hit) {
        $data[] = [
            $hit['_source']['id'],
            $hit['_source']['name'],
            preg_replace("/($term)/i", "$wrapBefore$1$wrapAfter", strip_tags($hit['_source']['name']))
        ];
    }

    return $data;
}

Outputs what I aimed with this question:

enter image description here

I added a bounty to see if there is a solution at ElasticSearch level to achive what I described above.

like image 634
João Alves Avatar asked Nov 11 '16 15:11

João Alves


1 Answers

As of now with latest version of elastic this is not possible as highligh documentation don't refer any settings or query for this. I checked elastic autocomplete example in browser console under xhr requests tab and found the response for "att" autocomplete response for keyword as follows.

url - https://search.elastic.co/suggest?q=att
    {
        "current_page": 1,
        "last_page": 4,
        "total_hits": 49,
        "hits": [
            {
                "tags": [],
                "url": "/elasticon/tour/2016/jp/not-attending",
                "section": "Elasticon",
                "title": "Not <em>Attending</em> - JP"
            },
            {
                "section": "Elasticon",
                "title": "<em>Attending</em> from Training - JP",
                "tags": [],
                "url": "/elasticon/tour/2016/jp/attending-training"
            },
            {
                "tags": [],
                "url": "/elasticon/tour/2016/jp/attending-keynote",
                "title": "<em>Attending</em> from Keynote - JP",
                "section": "Elasticon"
            },
            {
                "tags": [],
                "url": "/elasticon/tour/2016/not-attending",
                "section": "Elasticon",
                "title": "Thank You - Not <em>Attending</em>"
            },
            {
                "tags": [],
                "url": "/elasticon/tour/2016/attending",
                "section": "Elasticon",
                "title": "Thank You - <em>Attending</em>"
            },
            {
                "section": "Blog",
                "title": "What It's Like to <em>Attend</em> Elastic Training",
                "tags": [],
                "url": "/blog/what-its-like-to-attend-elastic-training"
            },
            {
                "tags": "Elasticsearch",
                "url": "/guide/en/elasticsearch/plugins/5.0/mapper-attachments-highlighting.html",
                "section": "Docs/",
                "title": "Highlighting <em>attachments</em>"
            },
            {
                "title": "<em>attachments</em> » email",
                "section": "Docs/",
                "tags": "Logstash",
                "url": "/guide/en/logstash/5.0/plugins-outputs-email.html#plugins-outputs-email-attachments"
            },
            {
                "section": "Docs/",
                "title": "Configuring Email <em>Attachments</em> » Actions",
                "tags": "Watcher",
                "url": "/guide/en/watcher/2.4/actions.html#configuring-email-attachments"
            },
            {
                "url": "/guide/en/watcher/2.4/actions.html#hipchat-action-attributes",
                "tags": "Watcher",
                "title": "HipChat Action <em>Attributes</em> » Actions",
                "section": "Docs/"
            },
            {
                "title": "Slack Action <em>Attributes</em> » Actions",
                "section": "Docs/",
                "tags": "Watcher",
                "url": "/guide/en/watcher/2.4/actions.html#slack-action-attributes"
            }
        ],
        "aggs": {
            "sections": [
                {
                    "Elasticon": 5
                },
                {
                    "Blog": 1
                },
                {
                    "Docs/": 43
                }
            ],
            "top_tags": [
                {
                    "XPack": 14
                },
                {
                    "Elasticsearch": 12
                },
                {
                    "Watcher": 9
                },
                {
                    "Logstash": 4
                },
                {
                    "Clients": 3
                },
                {
                    "Shield": 1
                }
            ]
        }
    }

But on frontend they are showing "att" only highlighted on in the autosuggest results. Hence they are handling the highlight stuff on browser layer.

like image 200
user3775217 Avatar answered Oct 05 '22 00:10

user3775217