Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch Histogram of visits

I'm quite new to Elasticsearch and I fail to build a histogram based on ranges of visits. I am not even sure that it's possible to create this kind of chart by using a single query in Elasticsearch, but I'm the feeling that could be possible with pipeline aggregation or may be scripted aggregation.

Here is a test dataset with which I'm working:

PUT /test_histo
{ "settings": { "number_of_shards": 1 }}

PUT /test_histo/_mapping/visit
{
   "properties": {
      "user": {"type": "string" },
      "datevisit": {"type": "date"},
      "page": {"type": "string"}
   }
}

POST test_histo/visit/_bulk
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Jean","page":"productXX.hmtl","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Robert","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"media_center.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"media_center.html","datevisit":"2015-11-26"}

If we consider the ranges [1,2[, [2,3[, [3, inf.[

The expected result should be :

  • [1,2[ = 2
  • [2,3[ = 1
  • [3, inf.[ = 1

All my efforts to find the histogram showing a customer visit frequency remained to date unsuccessful. I would be pleased to have a few tips, tricks or ideas to get a response to my problem.

like image 984
Warren Shore Avatar asked Nov 28 '15 10:11

Warren Shore


1 Answers

There are two ways you can do it.

First is doing it in ElasticSearch which will require Scripted Metric Aggregation. You can read more about it here.

Your query would look like this

{
  "size": 0,
  "aggs": {
    "visitors_over_time": {
      "date_histogram": {
        "field": "datevisit",
        "interval": "week"
      },
      "aggs": {
        "no_of_visits": {
          "scripted_metric": {
            "init_script": "_agg['values'] = new java.util.HashMap();",
            "map_script": "if (_agg.values[doc['user'].value]==null) {_agg.values[doc['user'].value]=1} else {_agg.values[doc['user'].value]+=1;}",
            "combine_script": "someHashMap = new java.util.HashMap();for(x in _agg.values.keySet()) {value=_agg.values[x];if(value<3){key='[' + value +',' + (value + 1) + '[';}else{key='[' + value +',inf[';}; if(someHashMap[key]==null){someHashMap[key] = 1}else{someHashMap[key] += 1}}; return someHashMap;"
          }
        }
      }
    }
  }
}

where you can change period of time in date_histogram object in the field interval by values like day, week, month.

Your response would look like this

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 7,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "visitors_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-11-23T00:00:00.000Z",
          "key": 1448236800000,
          "doc_count": 7,
          "no_of_visits": {
            "value": [
              {
                "[2,3[": 1,
                "[3,inf[": 1,
                "[1,2[": 2
              }
            ]
          }
        }
      ]
    }
  }
} 

Second method is to the work of scripted_metric in client side. You can use the result of Terms Aggregation. You can read more about it here.

Your query will look like this GET test_histo/visit/_search

{
  "size": 0,
  "aggs": {
    "visitors_over_time": {
      "date_histogram": {
        "field": "datevisit",
        "interval": "week"
      },
      "aggs": {
        "no_of_visits": {
          "terms": {
            "field": "user",
            "size": 10
          }
        }
      }
    }
  }
}

and the response will be

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 7,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "visitors_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-11-23T00:00:00.000Z",
          "key": 1448236800000,
          "doc_count": 7,
          "no_of_visits": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "john",
                "doc_count": 3
              },
              {
                "key": "mary",
                "doc_count": 2
              },
              {
                "key": "jean",
                "doc_count": 1
              },
              {
                "key": "robert",
                "doc_count": 1
              }
            ]
          }
        }
      ]
    }
  }
}

where on the response you can do count for each doc_count for each period.

like image 142
Sumit Avatar answered Sep 29 '22 02:09

Sumit