Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cardinality over Date Histogram

What would be the best way to query Elasticsearch in order to implement a date histogram representing the total number of unique visitors metric?

Considering the following data:

PUT /events
{
"mappings" : {
        "_doc" : {
            "properties" : {
                "userId" : { "type" : "keyword" },
                "eventDate" : { "type" : "date" }
            }
        }
    }
}

POST /events/_bulk
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "1" } }
{"userId": "1","eventDate": "2019-03-04T13:40:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "2" } }
{"userId": "2","eventDate": "2019-03-04T13:46:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "3" } }
{"userId": "3","eventDate": "2019-03-04T13:50:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "4" } }
{"userId": "1","eventDate": "2019-03-05T13:46:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "5" } }
{"userId": "4","eventDate": "2019-03-05T13:46:18.514Z"}

Now, if I query the cardinality of the userId field I get the 4 distinct visitors.

POST /events/_search
{
    "size": 0,
    "aggs": {
        "visitors": {
            "cardinality": {
                "field": "userId"
            }
        }
    }
}

However, distributing the documents over a date histogram, I get a total sum of 5 because there's a repeated userId in both buckets.

POST /events/_search
{
    "size": 0,
    "aggs": {
        "visits_over_time": {
            "date_histogram": {
                "field": "eventDate",
                "interval": "1d"
            },
            "aggs": {
                "visitors": {
                    "cardinality": {
                        "field": "userId"
                    }
                }
            }
        }
    }
}

Is there a way to filter out those repeated values? What would be the best way to accomplish this?

like image 594
AndreLDM Avatar asked Mar 11 '19 12:03

AndreLDM


2 Answers

We faced the same issue in our code and our solution was to use a Terms Aggregation on the UserId Field with a nested Min Aggregation on the datetime field. This provides you with a bucket for each userId containing the Bucket with the first visit. We do this aggregation outside of the date histogram and map it manually afterwards.

"aggs": {
    "UniqueUsers": {
      "terms": {
        "field": "userId",
        "size": 1000,
      }, "aggs": {
        "FirstSeen": {
          "min": {
            "field": "date"
          }
        }
      }
    }
  }

This works for us, but i am sure there should be a better implementation.

like image 60
aHochstein Avatar answered Nov 03 '22 05:11

aHochstein


The user ids are repeated but they occur on different days so distributing them by days will make it occur more than once unless you are looking at a specific day. Even then, if the same id occurs on the same day more than once, you may still have duplicate ids depending on how precise of a time frame you are looking at. Since you are looking at one day intervals, it is correct that it returns 5 records and should say that on the 4th, there were 3 ids one of which is the duplicate and on the next day shows two records with two different ids one of which is the duplicate. If you increase the interval to a week or a month, those duplicates will be counted as one.

I'm sure you have come across this but give it another look as it is explaining your exact use case. Link

Basically, it is returning all the unique visitors on a given day. If you don't care about individual users but just want to know how many then you need a different approach. Perhaps a group by query

like image 32
RisingSun Avatar answered Nov 03 '22 05:11

RisingSun