Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch - Statistical facet on length of string field

I would like to retrieve data about a string field like the min, max and average length (by counting the number of characters inside the string). My issue is that aggregations can only be used for numeric fields. Besides, I tried it using a simple statistical facet,

 "query":{
      "match_all": {}
  }, 
 "facets":{
      "stat1":{
           "statistical":{
               "field":"title"}
               }
          } 

but I get shard failures and SearchPhaseExecutionException. When trying with a script field the error returned is an OutOfMemoryError:

  "query":{
       "match_all": {}
   }, 
  "script_fields":{
       "test1":{"script": "doc[\"title\"].value" }
   }

Is it possible to retrive such data about a simple "title" string field using CURL? Thank you!

like image 700
Crista23 Avatar asked Mar 20 '23 21:03

Crista23


1 Answers

I haven't actually tried the following, but I believe it should work.

First some useful doc-references:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-statistical-facet.html.

In order to implement the statistical facet, the relevant field values are loaded into memory from the index. This means that per shard, there should be enough memory to contain them. Since by default, dynamic introduced types are long and double, one option to reduce the memory footprint is to explicitly set the types for the relevant fields to either short, integer, or float when possible.

I'm not sure directly how to set the type of the script-field to 'short' which is probably what you want. to reduce memory. it SHOULD be possible though.

ALSO: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-script-fields.html

It’s important to understand the difference between doc['my_field'].value and _source.my_field. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (can’t return a json object from it) and make sense only on non-analyzed or single term based fields.

So ALTERNATIVE: would be to use _source instead of doc which would not cache the lengths.

Gives:

    {
        "query" : {
            "match_all" : {}
        },
        "facets" : {
            "stat1" : {
                "statistical" : {
                    "script" : "doc['title'].value.length()
                    //"script" : "_source.title.length() //ALTERNATIVE which isn't cached
                }
            }
        }
    }
like image 184
Geert-Jan Avatar answered Apr 06 '23 16:04

Geert-Jan