Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elasticsearch getting too many results, need help filtering query

I'm having much problem understanding the underlying of ES querying system.

I've got the following query for example:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "referer": "www.xx.yy.com"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now",
              "lt": "now-1h"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "interval": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "0.5h"
      },
      "aggs": {
        "what": {
          "cardinality": {
            "field": "host"
          }
        }
      }
    }
  }
}

That request get too many results:

"status" : 500, "reason" : "ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data for field [@timestamp] would be larger than limit of [3200306380/2.9gb]]; nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data for field [@timestamp] would be larger than limit of [3200306380/2.9gb]]; nested: CircuitBreakingException[Data too large, data for field [@timestamp] would be larger than limit of [3200306380/2.9gb]]; "

I've tryied that request:

{
  "size": 0,
  "filter": {
    "and": [
      {
        "term": {
          "referer": "www.geoportail.gouv.fr"
        }
      },
      {
        "range": {
          "@timestamp": {
            "from": "2014-10-04",
            "to": "2014-10-05"
          }
        }
      }
    ]
  },
  "aggs": {
    "interval": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "0.5h"
      },
      "aggs": {
        "what": {
          "cardinality": {
            "field": "host"
          }
        }
      }
    }
  }
}

I would like to filter the data in order to be able to get a correct result, any help would be much appreciated!

like image 285
Alexandre Mélard Avatar asked Dec 04 '14 14:12

Alexandre Mélard


3 Answers

I found a solution, it's kind of weird. I've followed dimzak adviced and clear the cache:

curl --noproxy localhost -XPOST "http://localhost:9200/_cache/clear"

Then I used filtering instead of querying as Olly suggested:

{
  "size": 0,
  "query": {
    "filtered": {
      "query":  {
        "term": {
          "referer": "www.xx.yy.fr"
        }
      },
      "filter" : { 
        "range": {
          "@timestamp": { 
            "from": "2014-10-04T00:00", 
            "to": "2014-10-05T00:00"
          }  
        }
      }
    }
  },
  "aggs": {
  "interval": {
    "date_histogram": {
    "field": "@timestamp",
    "interval": "0.5h"
    },
    "aggs": {
    "what": {
      "cardinality": {
      "field": "host"
      }
    }
    }
  }
  }
}

I cannot give you both the ansxwer, I think dimzak deserves it best, but thumbs up to you two guys :)

like image 189
Alexandre Mélard Avatar answered Sep 21 '22 13:09

Alexandre Mélard


You can try clearing cache first and then execute the above query as shown here.

Another solution may be to remove interval or reduce time range in your query...

My best bet would be either clear cache first, or allocate more memory to elasticsearch (more here)

like image 28
dimzak Avatar answered Sep 21 '22 13:09

dimzak


Using a filter would improve performance:

{
  "size": 0,
  "query": {
    "filtered": {
      "query":  {
          "term": {
            "referer": "www.xx.yy.com"
          }
       },
       "filter" : {"range": {
            "@timestamp": { "gte": "now", "lt": "now-1h"
              }
            }
          }
       }
    },
  "aggs": {
    "interval": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "0.5h"
      },
      "aggs": {
        "what": {
          "cardinality": {
            "field": "host"
          }
        }
      }
    }
  }
}

You may also find that date range is better than date histogram - you need to define the buckets yourself.

is the referer field being analysed? or do you want an exact match on this - if so set it to not_analyzed.

is there much cardinality in your hostname field? have you tried pre-hashing the values?

like image 31
Olly Cruickshank Avatar answered Sep 21 '22 13:09

Olly Cruickshank