I have a Use Case where I want to use ElasticSearch for realtime analytics. Within that, I want to be able to calculate some simple affinity scores.
Those are currently defined using the number of transactions a filtered-by-criteria user base performs, compared with the complete user base.
From my understanding, I'd need to do the following:
To get the "distinct transactions" for the filtered user base, I currently use a Terms Filter Query with faceting which returns all terms (transaction types). As far as I understand, I's need to use this result as input of a Terms Filter Query for the second step to be able to receive the result I want.
I read that there's a pull request on GitHub which seems to implement this (https://github.com/elasticsearch/elasticsearch/pull/3278), but it's not really obvious to me whether this is already usable in a current release or not.
If not, are there some workarounds how I could implement this?
As additional info, here is my sample mapping:
curl -XPUT 'http://localhost:9200/store/user/_mapping' -d '
{
"user": {
"properties": {
"user_id": { "type": "integer" },
"gender": { "type": "string", "index" : "not_analyzed" },
"age": { "type": "integer" },
"age_bracket": { "type": "string", "index" : "not_analyzed" },
"current_city": { "type": "string", "index" : "not_analyzed" },
"relationship_status": { "type": "string", "index" : "not_analyzed" },
"transactions" : {
"type": "nested",
"properties" : {
"t_id": { "type": "integer" },
"t_oid": { "type": "string", "index" : "not_analyzed" },
"t_name": { "type": "string", "index" : "not_analyzed" },
"tt_id": { "type": "integer" },
"tt_name": { "type": "string", "index" : "not_analyzed" },
}
}
}
}
}'
So, for my actual desired result for my example Use Case, I'd have the following:
One challenge is that Elasticsearch does not support joins between indexes. This means that we need to find another way to combine the data from the two indexes. Another challenge is that the data in each index may be structured differently. This can make it difficult to combine the data from the two indexes.
Joining queriesedit Instead, Elasticsearch offers two forms of join which are designed to scale horizontally. Documents may contain fields of type nested . These fields are used to index arrays of objects, where each object can be queried (with the nested query) as an independent document.
You can find elasticsearch. yml in /usr/share/elasticsearch/config/elasticsearch. yml (Docker) or /etc/elasticsearch/elasticsearch.
Here's a link to a runnable example:
http://sense.qbox.io/gist/9da6a30fc12c36f90ae39111a08df283b56ec03c
It presumes documents that look like:
{ "transaction_type" : "some_transaction", "user_base" : "some_user_base_id" }
The query is set to return no results, since aggregations take care of computing the stats you're looking for:
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"distinct_transactions" : {
"terms" : {
"field" : "transaction_type",
"size" : 20
},
"aggs" : {
"by_user_base" : {
"terms" : {
"field" : "user_base",
"size" : 20
}
}
}
}
}
}
And here's what the result looks like:
"aggregations": {
"distinct_transactions": {
"buckets": [
{
"key": "subscribe",
"doc_count": 4,
"by_user_base": {
"buckets": [
{
"key": "2",
"doc_count": 3
},
{
"key": "1",
"doc_count": 1
}
]
}
},
{
"key": "purchase",
"doc_count": 3,
"by_user_base": {
"buckets": [
{
"key": "1",
"doc_count": 2
},
{
"key": "2",
"doc_count": 1
}
]
}
}
]
}
}
So, inside of "aggregations", you'll have a list of "distinct_transactions". The key will be the transaction type, and the doc_count will represent the total transactions by all users.
Inside of each "distinct_transaction", there's "by_user_base", which is another terms agg (nested). Just like the transactions, the key will represent the user base name (or ID or whatever) and the doc_count will represent that unique user base's # of transactions.
Is that what you were looking to do? Hope I helped.
With the current version of ElasticSerach, there's the new significant_terms
aggregation type, which can be used to calculate the affinity scores for my use case in a more simple way.
All the to me relevant metrics can then be calculated in one step, which is very nice!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With