Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch Aggregation by Day of Week and Hour of Day

I have documents of type:

[{"msg":"hello", date: "some-date"},{"msg":"hi!", date: "some-date"}, ...

I want to have the count of documents by day of week. For example x messages were sent on Monday and y were sent on Tuesday and so on.

I have used date_histogram with aggregation but it returns me the documents day wise. It does return me the day, but say "Wed, 22" and "Wed, 29" are returned as separate aggregation documents.

This is somewhat related to Elasticsearch - group by day of week and hour but there is no answer to that question so I am reposting it. According to the suggestion there it asks me to do term aggregation on key_as_string, but I need to add doc_count for every object instead of just count the terms. I also don't know how to use key_as_string in the nested aggregation.

This is what I have tried:

"aggs" : {
                "posts_over_days" : {
                    "date_histogram" : { 
                        "field" : "created_time", 
                        "interval": "day",
                        "format": "E" 
                    }
                }
like image 820
Sambhav Sharma Avatar asked Mar 11 '15 14:03

Sambhav Sharma


3 Answers

Re-post from my answer here: https://stackoverflow.com/a/31851896/6247

Does this help:

"aggregations": {
    "timeslice": {
        "histogram": {
            "script": "doc['timestamp'].value.getHourOfDay()",
            "interval": 1,
            "min_doc_count": 0,
            "extended_bounds": {
                "min": 0,
                "max": 23
            },
            "order": {
                "_key": "desc"
            }
        }
    }

This is nice, as it'll also include any hours with zero results, and, it'll extend the results to cover the entire 24 hour period (due to the extended_bounds).

You can use 'getDayOfWeek', 'getHourOfDay', ... (see 'Joda time' for more).

This is great for hours, but for days/months it'll give you a number rather than the month name. To work around, you can get the timeslot as a string - but, this won't work with the extended bounds approach, so you may have empty results (i.e. [Mon, Tues, Fri, Sun]).

In-case you want that, it is here:

"aggregations": {
    "dayOfWeek": {
        "terms": {
            "script": "doc['timestamp'].value.getDayOfWeek().getAsText()",
            "order": {
                "_term": "asc"
            }
        }
    }

Even if this doesn't help you, hopefully someone else will find it and benefit from it.

like image 153
RichS Avatar answered Nov 14 '22 20:11

RichS


The same kind of problem has been solved in this thread.

Adapting the solution to your problem, we need to make a script to convert the date into the hour of day and day of week:

Date date = new Date(doc['created_time'].value) ; 
java.text.SimpleDateFormat format = new java.text.SimpleDateFormat('EEE, HH');
format.format(date)

And use it in a query:

{
    "aggs": {
        "perWeekDay": {
            "terms": {
                "script": "Date date = new Date(doc['created_time'].value) ;java.text.SimpleDateFormat format = new java.text.SimpleDateFormat('EEE, HH');format.format(date)"
            }
        }
    }
}
like image 4
Heschoon Avatar answered Nov 14 '22 20:11

Heschoon


The simplest way would be to define a dedicated day-of-week field that holds only the day of the week for each document, then do a terms aggregation on that field.

If for whatever reason you don't want to do that (or can't), here is a hack that might help you get what you want. The basic idea is to define a "date.raw" sub-field that is a string, analyzed with the standard analyzer so that terms are created for each day of the week. Then you can aggregate on those terms to get your counts, using include to only include the terms you want.

Here is the mapping I used for testing:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1
   },
   "mappings": {
      "doc": {
         "properties": {
            "msg": {
               "type": "string"
            },
            "date": {
               "type": "date",
               "format": "E, dd MMM yyyy",
               "fields": {
                  "raw": {
                     "type": "string"
                  }
               }
            }
         }
      }
   }
}

and a few sample docs:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"msg": "hello","date": "Wed, 11 Mar 2015"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"msg": "hello","date": "Tue, 10 Mar 2015"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"msg": "hello","date": "Mon, 09 Mar 2015"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"msg": "hello","date": "Wed, 04 Mar 2015"}

and the aggregation and results:

POST /test_index/_search?search_type=count
{
    "aggs":{
        "docs_by_day":{
            "terms":{
                "field": "date.raw",
                "include": "mon|tue|wed|thu|fri|sat|sun"
            }
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "docs_by_day": {
         "buckets": [
            {
               "key": "wed",
               "doc_count": 2
            },
            {
               "key": "mon",
               "doc_count": 1
            },
            {
               "key": "tue",
               "doc_count": 1
            }
         ]
      }
   }
}

Here is the code all together:

http://sense.qbox.io/gist/0292ddf8a97b2d96bd234b787c7863a4bffb14c5

like image 2
Sloan Ahrens Avatar answered Nov 14 '22 21:11

Sloan Ahrens