What would be the best way to query Elasticsearch in order to implement a date histogram representing the total number of unique visitors metric?
Considering the following data:
PUT /events
{
"mappings" : {
"_doc" : {
"properties" : {
"userId" : { "type" : "keyword" },
"eventDate" : { "type" : "date" }
}
}
}
}
POST /events/_bulk
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "1" } }
{"userId": "1","eventDate": "2019-03-04T13:40:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "2" } }
{"userId": "2","eventDate": "2019-03-04T13:46:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "3" } }
{"userId": "3","eventDate": "2019-03-04T13:50:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "4" } }
{"userId": "1","eventDate": "2019-03-05T13:46:18.514Z"}
{ "index" : { "_index" : "events", "_type" : "_doc", "_id" : "5" } }
{"userId": "4","eventDate": "2019-03-05T13:46:18.514Z"}
Now, if I query the cardinality of the userId field I get the 4 distinct visitors.
POST /events/_search
{
"size": 0,
"aggs": {
"visitors": {
"cardinality": {
"field": "userId"
}
}
}
}
However, distributing the documents over a date histogram, I get a total sum of 5 because there's a repeated userId in both buckets.
POST /events/_search
{
"size": 0,
"aggs": {
"visits_over_time": {
"date_histogram": {
"field": "eventDate",
"interval": "1d"
},
"aggs": {
"visitors": {
"cardinality": {
"field": "userId"
}
}
}
}
}
}
Is there a way to filter out those repeated values? What would be the best way to accomplish this?
We faced the same issue in our code and our solution was to use a Terms Aggregation on the UserId Field with a nested Min Aggregation on the datetime field. This provides you with a bucket for each userId containing the Bucket with the first visit. We do this aggregation outside of the date histogram and map it manually afterwards.
"aggs": {
"UniqueUsers": {
"terms": {
"field": "userId",
"size": 1000,
}, "aggs": {
"FirstSeen": {
"min": {
"field": "date"
}
}
}
}
}
This works for us, but i am sure there should be a better implementation.
The user ids are repeated but they occur on different days so distributing them by days will make it occur more than once unless you are looking at a specific day. Even then, if the same id occurs on the same day more than once, you may still have duplicate ids depending on how precise of a time frame you are looking at. Since you are looking at one day intervals, it is correct that it returns 5 records and should say that on the 4th, there were 3 ids one of which is the duplicate and on the next day shows two records with two different ids one of which is the duplicate. If you increase the interval to a week or a month, those duplicates will be counted as one.
I'm sure you have come across this but give it another look as it is explaining your exact use case. Link
Basically, it is returning all the unique visitors on a given day. If you don't care about individual users but just want to know how many then you need a different approach. Perhaps a group by query
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With