Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch - return the complete value of a facet for a query

I've recently started using ElasticSearch. I try to complete some use cases. I have a problem for one of them.

I have indexed some users with their full name (e.g. "Jean-Paul Gautier", "Jean De La Fontaine").

I try to get all the full names responding to some query.

For example, I want the 100 most frequent full names beggining by "J"

{
  "query": {
    "query_string" : { "query": "full_name:J*" } }
  },
  "facets":{
    "name":{
      "terms":{
        "field": "full_name",
        "size":100
      }
    }
  }
}

The result I get is all the words of the full names : "Jean", "Paul", "Gautier", "De", "La", "Fontaine".

How to get "Jean-Paul Gautier" and "Jean De La Fontaine" (all the full_name values begging by 'J') ? The "post_filter" option is not doing this, it only restrict this above subset.

  • I have to configure "how works" this full_name facet
  • I have to add some options to this current query
  • I have to do some "mapping" (very obscure for the moment)

Thanks

like image 834
pierallard Avatar asked Jan 27 '14 15:01

pierallard


2 Answers

You just need to set "index": "not_analyzed" on the field, and you will be able to get back the full, unmodified field values in your facet.

Typically, it's nice to have one version of the field that isn't analyzed (for faceting) and another that is (for searching). The "multi_field" field type is useful for this.

So in this case, I can define a mapping as follows:

curl -XPUT "http://localhost:9200/test_index/" -d'
{
   "mappings": {
      "people": {
         "properties": {
            "full_name": {
               "type": "multi_field",
               "fields": {
                  "untouched": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "full_name": {
                     "type": "string"
                  }
               }
            }
         }
      }
   }
}'

Here we have two sub-fields. The one with the same name as the parent will be the default, so if you search against the "full_name" field, Elasticsearch will actually use "full_name.full_name". "full_name.untouched" will give you the facet results you want.

So next I add two documents:

curl -XPUT "http://localhost:9200/test_index/people/1" -d'
{
   "full_name": "Jean-Paul Gautier"
}'

curl -XPUT "http://localhost:9200/test_index/people/2" -d'
{
   "full_name": "Jean De La Fontaine"
}'

And then I can facet on each field to see what is returned:

curl -XPOST "http://localhost:9200/test_index/_search" -d'
{
   "size": 0,
   "facets": {
      "name_terms": {
         "terms": {
            "field": "full_name"
         }
      },
      "name_untouched": {
         "terms": {
            "field": "full_name.untouched",
            "size": 100
         }
      }
   }
}'

and I get back the following:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "facets": {
      "name_terms": {
         "_type": "terms",
         "missing": 0,
         "total": 7,
         "other": 0,
         "terms": [
            {
               "term": "jean",
               "count": 2
            },
            {
               "term": "paul",
               "count": 1
            },
            {
               "term": "la",
               "count": 1
            },
            {
               "term": "gautier",
               "count": 1
            },
            {
               "term": "fontaine",
               "count": 1
            },
            {
               "term": "de",
               "count": 1
            }
         ]
      },
      "name_untouched": {
         "_type": "terms",
         "missing": 0,
         "total": 2,
         "other": 0,
         "terms": [
            {
               "term": "Jean-Paul Gautier",
               "count": 1
            },
            {
               "term": "Jean De La Fontaine",
               "count": 1
            }
         ]
      }
   }
}

As you can see, the analyzed field returns single-word, lower-cased tokens (when you don't specify an analyzer, the standard analyzer is used), and the un-analyzed sub-field returns the unmodified original text.

Here is a runnable example you can play with: http://sense.qbox.io/gist/7abc063e2611846011dd874648fd1b77450b19a5

like image 122
Sloan Ahrens Avatar answered Oct 21 '22 01:10

Sloan Ahrens


Try altering the mapping for "full_name":

"properties": {
  "full_name": {
     "type": "string",
     "index": "not_analyzed"
  }
  ...
}

not_analyzed means that it will be kept as is, capitals, spaces, dashes etc, so that "Jean De La Fontaine" will stay findable and not be tokenized into "Jean" "De" "La" "Fontaine"

You can experiment with different analyzers using the api

Notice what the standard one does to a mulit part name:

GET /_analyze?analyzer=standard
{'Jean Claude Van Dame'}


{
   "tokens": [
      {
         "token": "jean",
         "start_offset": 2,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "claude",
         "start_offset": 7,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "van",
         "start_offset": 14,
         "end_offset": 17,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "dame",
         "start_offset": 18,
         "end_offset": 22,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}
like image 43
mconlin Avatar answered Oct 21 '22 00:10

mconlin