I am playing around with ES to understand if it can cover most of my scenarios. I am at the point where I am stuck thinking how to reach a certain results that is pretty simple in SQL.
This is the example
In elastic I have an index with this documents
{ "Id": 1, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160101, "BestBeforeDate": 20160102, "BiteBy":"John"}
{ "Id": 2, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160102, "BestBeforeDate": 20160104, "BiteBy":"Mat"}
{ "Id": 3, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160103, "BestBeforeDate": 20160105, "BiteBy":"Mark"}
{ "Id": 4, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160104, "BestBeforeDate": 20160201, "BiteBy":"Simon"}
{ "Id": 5, "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160112, "BestBeforeDate": 20160112, "BiteBy":"John"}
{ "Id": 6, "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160114, "BestBeforeDate": 20160116, "BiteBy":"Mark"}
{ "Id": 7, "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160120, "BestBeforeDate": 20160121, "BiteBy":"Simon"}
{ "Id": 8, "Fruit": "Kiwi", "BoughtInStore"="Shop", "BoughtDate"=20160121, "BestBeforeDate": 20160121, "BiteBy":"Mark"}
{ "Id": 8, "Fruit": "Kiwi", "BoughtInStore"="Jungle", "BoughtDate"=20160121, "BestBeforeDate": 20160121, "BiteBy":"Simon"}
If I would like to know how many fruits bought in different store people bite in specific date range in SQL I write something like this
SELECT
COUNT(DISTINCT kpi.Fruit) as Fruits,
kpi.BoughtInStore,
kpi.BiteBy
FROM
(
SELECT f1.Fruit, f1.BoughtInStore, f1.BiteBy
FROM FruitsTable f1
WHERE f1.BoughtDate = (
SELECT MAX(f2.BoughtDate)
FROM FruitsTable f2
WHERE f1.Fruit = f2.Fruit
and f2.BoughtDate between 20160101 and 20160131
and (f2.BestBeforeDate between 20160101 and 20160131)
)
) kpi
GROUP BY kpi.BoughtInStore, kpi.ByteBy
the results is something like this
{ "Fruits": 1, "BoughtInStore": "Jungle", "BiteBy"="Mark"}
{ "Fruits": 1, "BoughtInStore": "Shop", "BiteBy"="Mark"}
{ "Fruits": 2, "BoughtInStore": "Jungle", "BiteBy"="Simon"}
Do you have any idea how I can reach the same result in Elastic with aggregation?
In a few words the problems I am facing in elastic are:
Thank you
As I understand there's no way to refer aggregation result in filter of the same query. So you can solve only part of the puzzle with single query:
GET /purchases/fruits/_search
{
"query": {
"filtered":{
"filter": {
"range": {
"BoughtDate": {
"gte": "2015-01-01", //assuming you have right mapping for dates
"lte": "2016-03-01"
}
}
}
}
},
"sort": { "BoughtDate": { "order": "desc" }},
"aggs": {
"byBoughtDate": {
"terms": {
"field": "BoughtDate",
"order" : { "_term" : "desc" }
},
"aggs": {
"distinctCount": {
"cardinality": {
"field": "Fruit"
}
}
}
}
}
}
So you will have all the documents within the range of dates, and you will have aggregated bucket counts, sorted by term, so max date will be on the top. Client can parse this first bucket (both count and value), and then take the documents for this date value. For distinct fruit count you just use nested cardinality aggregation.
Yep, the query returns much more info than you needed, but that's the life :)
Naturally there's no direct route from SQL to the Elasticsearch DSL, but there are some pretty common correlations.
For starters, any GROUP BY
/ HAVING
is going to come down to an aggregation. The normal query semantics can generally be covered (and more so) by the Query DSL.
How prepare a subsed of data before aggregation (like in this example the latest row in the range per each Fruit)
So, you're kind of asking for two different things.
How prepare a subsed of data before aggregation
This is the query phase.
(like in this example the latest row in the range per each Fruit)
You're technically asking it to aggregate to get the answer to this example: not a normal query. In your example, you're doing MAX
to get this which is in effect using a GROUP BY to get it.
How to group results by multiple fields
It depends. Do you want them tiered (generally, yes) or do you want them together.
If you want them tiered, then you just use sub-aggregations to get what you want. If you want them combined, then you generally just use a filters
aggregation for the different groupings.
Putting it all back together: You want the most recent purchase, per fruit, given a certain filtered date range. The date ranges are just normal queries / filters:
{
"query": {
"bool": {
"filter": [
{
"range": {
"BoughtDate": {
"gte": "2016-01-01",
"lte": "2016-01-31"
}
}
},
{
"range": {
"BestBeforeDate": {
"gte": "2016-01-01",
"lte": "2016-01-31"
}
}
}
]
}
}
}
With that, no document will be included in the request that is not within those date ranges for both fields (effectively an AND
). Because I used a filter, it's unscored and cacheable.
Now, you need to start aggregating to get the rest of the information. Let's start by assuming the documents have been filtered using the above filter in order to simplify what we're looking at. We'll combine it at the end.
{
"size": 0,
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "BoughtDate",
"interval": "day",
"min_doc_count": 1
},
"aggs": {
"group_by_store": {
"terms": {
"field": "BoughtInStore"
},
"aggs": {
"group_by_person": {
"terms": {
"field": "BiteBy"
}
}
}
}
}
}
}
}
You want "size" : 0
at the top level because you don't actually care about hits. You only want aggregated results.
Your first aggregation was actually grouping by the most recent date. I changed it a little to make it a little more realistic (each day), but it's effectively the same. The way that you use MAX
, we could use a terms
aggregation with "size": 1
, but this is truer to how you'd want to do it when a date (and presumably time!) is involved. I also asked it to ignore days in the matching documents that have no data (since it's going from the start to the end, we don't actually care about those days).
If you really only wanted the last day, then you could use a pipeline aggregation to drop everything except the max bucket, but a realistic usage of this type of request would want the full date range.
So, we then continue by grouping by store, which is what you want. Then, we sub-group by person (BiteBy
). This will give you the count implicitly.
Putting it all back together:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"BoughtDate": {
"gte": "2016-01-01",
"lte": "2016-01-31"
}
}
},
{
"range": {
"BestBeforeDate": {
"gte": "2016-01-01",
"lte": "2016-01-31"
}
}
}
]
}
},
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "BoughtDate",
"interval": "day",
"min_doc_count": 1
},
"aggs": {
"group_by_store": {
"terms": {
"field": "BoughtInStore"
},
"aggs": {
"group_by_person": {
"terms": {
"field": "BiteBy"
}
}
}
}
}
}
}
}
Note: Here's how I indexed the data.
PUT /grocery/store/_bulk
{"index":{"_id":"1"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-01","BestBeforeDate":"2016-01-02","BiteBy":"John"}
{"index":{"_id":"2"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-02","BestBeforeDate":"2016-01-04","BiteBy":"Mat"}
{"index":{"_id":"3"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-03","BestBeforeDate":"2016-01-05","BiteBy":"Mark"}
{"index":{"_id":"4"}}
{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-04","BestBeforeDate":"2016-02-01","BiteBy":"Simon"}
{"index":{"_id":"5"}}
{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-12","BestBeforeDate":"2016-01-12","BiteBy":"John"}
{"index":{"_id":"6"}}
{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-14","BestBeforeDate":"2016-01-16","BiteBy":"Mark"}
{"index":{"_id":"7"}}
{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-20","BestBeforeDate":"2016-01-21","BiteBy":"Simon"}
{"index":{"_id":"8"}}
{"Fruit":"Kiwi","BoughtInStore":"Shop","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Mark"}
{"index":{"_id":"9"}}
{"Fruit":"Kiwi","BoughtInStore":"Jungle","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Simon"}
It's critical that your string values that you want to aggregate on (store and person) are not_analyzed
string
s (keyword
in ES 5.0)! Otherwise it will use what's called fielddata and that's not a good thing.
The mappings would look like this in ES 1.x / ES 2.x:
PUT /grocery
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"store": {
"properties": {
"Fruit": {
"type": "string",
"index": "not_analyzed"
},
"BoughtInStore": {
"type": "string",
"index": "not_analyzed"
},
"BiteBy": {
"type": "string",
"index": "not_analyzed"
},
"BestBeforeDate": {
"type": "date"
},
"BoughtDate": {
"type": "date"
}
}
}
}
}
All of this together, and you get the answer as:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_date": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000Z",
"key": 1451606400000,
"doc_count": 1,
"group_by_store": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Jungle",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John",
"doc_count": 1
}
]
}
}
]
}
},
{
"key_as_string": "2016-01-02T00:00:00.000Z",
"key": 1451692800000,
"doc_count": 1,
"group_by_store": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Jungle",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mat",
"doc_count": 1
}
]
}
}
]
}
},
{
"key_as_string": "2016-01-03T00:00:00.000Z",
"key": 1451779200000,
"doc_count": 1,
"group_by_store": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Jungle",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mark",
"doc_count": 1
}
]
}
}
]
}
},
{
"key_as_string": "2016-01-12T00:00:00.000Z",
"key": 1452556800000,
"doc_count": 1,
"group_by_store": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Jungle",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John",
"doc_count": 1
}
]
}
}
]
}
},
{
"key_as_string": "2016-01-14T00:00:00.000Z",
"key": 1452729600000,
"doc_count": 1,
"group_by_store": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Jungle",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mark",
"doc_count": 1
}
]
}
}
]
}
},
{
"key_as_string": "2016-01-20T00:00:00.000Z",
"key": 1453248000000,
"doc_count": 1,
"group_by_store": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Jungle",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Simon",
"doc_count": 1
}
]
}
}
]
}
},
{
"key_as_string": "2016-01-21T00:00:00.000Z",
"key": 1453334400000,
"doc_count": 2,
"group_by_store": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Jungle",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Simon",
"doc_count": 1
}
]
}
},
{
"key": "Shop",
"doc_count": 1,
"group_by_person": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mark",
"doc_count": 1
}
]
}
}
]
}
}
]
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With