Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

The query:

{
  "aggregations": {
    "sigTerms": {
      "significant_terms": {
        "field": "translatedTitle"
      },
      "aggs": {
        "assocs": {
          "significant_terms": {
            "field": "translatedTitle"
          }
        }
      }
    }
  },
  "size": 0,
  "from": 0,
  "query": {
    "range": {
      "timestamp": {
        "lt": "now+1d/d",
        "gte": "now/d"
      }
    }
  },
  "track_scores": false
}

Error:

{
  "bytes_limit": 6844055552,
  "bytes_wanted": 6844240272,
  "reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [6844055552/6.3gb]",
  "type": "circuit_breaking_exception"
}

Index size is 5G. How much memory does the cluster need to execute this query?

like image 911
esp Avatar asked May 13 '16 17:05

esp


3 Answers

You can try to increase the request circuit breaker limit to 41% (default is 40%) in your elasticsearch.yml config file and restart your cluster:

indices.breaker.request.limit: 41%

Or if you prefer to not restart your cluster you can change the setting dynamically using:

curl -XPUT localhost:9200/_cluster/settings -d '{
  "persistent" : {
    "indices.breaker.request.limit" : "41%" 
  }
}'

Judging by the numbers showing up (i.e. "bytes_limit": 6844055552, "bytes_wanted": 6844240272), you're just missing ~190 KB of heap, so increasing by 1% to 41% you should get 17 MB of additional heap (your total heap = ~17GB) for your request breaker which should be sufficient.

Just make sure to not increase this value too high, as you run the risk of going OOM since the request circuit breaker also shares the heap with the fielddata circuit breaker and other components.

like image 148
Val Avatar answered Nov 11 '22 03:11

Val


Circuit breakers are designed to deal with situations when request processing needs more memory than available. You can set limit by using following query

PUT /_cluster/settings
{
  "persistent" : {
    "indices.breaker.request.limit" : "45%" 
  }
}

You can get more information on

https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html

like image 41
chetan varma Avatar answered Nov 11 '22 02:11

chetan varma


I am not sure what you are trying to do, but I'm curious to find out. Since you get that exception, I can assume the cardinality of that field is not small. You are basically trying to see, I guess, the relationships between all the terms in that field, based on significance.

The first significant_terms aggregation will consider all the terms from that field and establish how "significant" they are (calculating frequencies of that term in the whole index and then comparing those with the frequencies from the range query set of documents).

After it's doing that (for all the terms), you want a second significant_aggregation that should do the first step, but now considering each term and doing for it another significant_aggregation. That's gonna be painful. Basically, you are computing number_of_term * number_of_terms significant_terms calculations.

The big question is what are you trying to do?

If you want to see a relationship between all the terms in that field, that's gonna be expensive for the reasons explained above. My suggestion is to run a first significant_terms aggregation, take the first 10 terms or so and then run a second query with another significant_terms aggregation but limiting the terms by probably doing a parent terms aggregation and include only those 10 from the first query.

You can, also, take a look at sampler aggregation and use that as a parent for your only one significant terms aggregation.

Also, I don't think increasing the circuit breaker limit is the real solution. Those limits were chosen with a reason. You can increase that and maybe it will work, but it has to make you ask yourself if that's the right query for your use case (as it doesn't sound like it is). That limit value that it's in the exception might not be the final one... reused_arrays refers to an array class in Elasticsearch that is resizeable, so if more elements are needed, the array size is increased and you may hit the circuit breaker again, for another value.

like image 28
Andrei Stefan Avatar answered Nov 11 '22 03:11

Andrei Stefan