I have a Use Case where I want to use ElasticSearch for realtime analytics. Within that, I want to be able to calculate some simple affinity scores. Those are currently defined using the number of transactions a filtered-by-criteria user base performs, compared with the complete user base. From my understanding, I'd need to do the following: <ol> <li>Get the distinct transactions of my filtered user base</li> <li>Query for these transaction (types) in the complete user base</li> <li>Do the calculation (norming etc.)</li> </ol> To get the "distinct transactions" for the filtered user base, I currently use a Terms Filter Query with faceting which returns all terms (transaction types). As far as I understand, I's need to use this result as input of a Terms Filter Query for the second step to be able to receive the result I want. I read that there's a pull request on GitHub which seems to implement this (https://github.com/elasticsearch/elasticsearch/pull/3278), but it's not really obvious to me whether this is already usable in a current release or not. If not, are there some workarounds how I could implement this? As additional info, here is my sample mapping: <pre class="prettyprint"><code>curl -XPUT 'http://localhost:9200/store/user/_mapping' -d ' { "user": { "properties": { "user_id": { "type": "integer" }, "gender": { "type": "string", "index" : "not_analyzed" }, "age": { "type": "integer" }, "age_bracket": { "type": "string", "index" : "not_analyzed" }, "current_city": { "type": "string", "index" : "not_analyzed" }, "relationship_status": { "type": "string", "index" : "not_analyzed" }, "transactions" : { "type": "nested", "properties" : { "t_id": { "type": "integer" }, "t_oid": { "type": "string", "index" : "not_analyzed" }, "t_name": { "type": "string", "index" : "not_analyzed" }, "tt_id": { "type": "integer" }, "tt_name": { "type": "string", "index" : "not_analyzed" }, } } } } }' </code></pre> So, for my actual desired result for my example Use Case, I'd have the following: <ol> <li>My filtered user base would have this example filter: "gender": "male" & "relationship_status": "single". For these, I want to get the distinct transaction types (field "tt_name" of the nested document) and count the number of distinct user_ids.</li> <li>Next, I want to query my complete user base (no filter other than the list of transaction types from 1.) and count the number of distinct user_ids</li> <li>Do the "affinity" calculations</li> </ol>

With the current version of ElasticSerach, there's the new <code>significant_terms</code> aggregation type, which can be used to calculate the affinity scores for my use case in a more simple way. <ul> <li>http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html#_recommending_based_on_statistics</li> </ul> All the to me relevant metrics can then be calculated in one step, which is very nice!

ElasticSearch Join Filter: Using subquery results as filter input possible?

Tags:

join

filter

subquery

elasticsearch

I have a Use Case where I want to use ElasticSearch for realtime analytics. Within that, I want to be able to calculate some simple affinity scores.

Those are currently defined using the number of transactions a filtered-by-criteria user base performs, compared with the complete user base.

From my understanding, I'd need to do the following:

Get the distinct transactions of my filtered user base
Query for these transaction (types) in the complete user base
Do the calculation (norming etc.)

To get the "distinct transactions" for the filtered user base, I currently use a Terms Filter Query with faceting which returns all terms (transaction types). As far as I understand, I's need to use this result as input of a Terms Filter Query for the second step to be able to receive the result I want.

I read that there's a pull request on GitHub which seems to implement this (https://github.com/elasticsearch/elasticsearch/pull/3278), but it's not really obvious to me whether this is already usable in a current release or not.

If not, are there some workarounds how I could implement this?

As additional info, here is my sample mapping:

curl -XPUT 'http://localhost:9200/store/user/_mapping' -d '
{
  "user": {
    "properties": {
      "user_id": { "type": "integer" },
      "gender": { "type": "string", "index" : "not_analyzed" },
      "age": { "type": "integer" },
      "age_bracket": { "type": "string", "index" : "not_analyzed" },
      "current_city": { "type": "string", "index" : "not_analyzed" },
      "relationship_status": { "type": "string", "index" : "not_analyzed" },
      "transactions" : {
        "type": "nested",
        "properties" : {
          "t_id": { "type": "integer" },
          "t_oid": { "type": "string", "index" : "not_analyzed" },
          "t_name": { "type": "string", "index" : "not_analyzed" },
          "tt_id": { "type": "integer" },
          "tt_name": { "type": "string", "index" : "not_analyzed" },
        }
      }
    }
  }
}'

So, for my actual desired result for my example Use Case, I'd have the following:

My filtered user base would have this example filter: "gender": "male" & "relationship_status": "single". For these, I want to get the distinct transaction types (field "tt_name" of the nested document) and count the number of distinct user_ids.
Next, I want to query my complete user base (no filter other than the list of transaction types from 1.) and count the number of distinct user_ids
Do the "affinity" calculations

250

asked Feb 17 '14 15:02

Tobi

2 Answers

Here's a link to a runnable example:

http://sense.qbox.io/gist/9da6a30fc12c36f90ae39111a08df283b56ec03c

It presumes documents that look like:

{ "transaction_type" : "some_transaction", "user_base" : "some_user_base_id" }

The query is set to return no results, since aggregations take care of computing the stats you're looking for:

{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "distinct_transactions" : {
      "terms" : {
        "field" : "transaction_type",
        "size" : 20
      },
      "aggs" : {
        "by_user_base" : {
          "terms" : {
            "field" : "user_base",
            "size" : 20
          }
        }
      }
    }
  }
}

And here's what the result looks like:

  "aggregations": {
      "distinct_transactions": {
         "buckets": [
            {
               "key": "subscribe",
               "doc_count": 4,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "2",
                        "doc_count": 3
                     },
                     {
                        "key": "1",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "purchase",
               "doc_count": 3,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "1",
                        "doc_count": 2
                     },
                     {
                        "key": "2",
                        "doc_count": 1
                     }
                  ]
               }
            }
         ]
      }
   }

So, inside of "aggregations", you'll have a list of "distinct_transactions". The key will be the transaction type, and the doc_count will represent the total transactions by all users.

Inside of each "distinct_transaction", there's "by_user_base", which is another terms agg (nested). Just like the transactions, the key will represent the user base name (or ID or whatever) and the doc_count will represent that unique user base's # of transactions.

Is that what you were looking to do? Hope I helped.

199

answered Sep 18 '22 00:09

Ben at Qbox.io

With the current version of ElasticSerach, there's the new significant_terms aggregation type, which can be used to calculate the affinity scores for my use case in a more simple way.

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html#_recommending_based_on_statistics

All the to me relevant metrics can then be calculated in one step, which is very nice!

answered Sep 19 '22 00:09

Tobi

Related questions
                            
                                How to get all field names in elasticsearch index
                            
                                What is the fastest way of indexing to ElasticSearch
                            
                                What's the best Kibana multi tenancy free open source project?
                            
                                How to add pre-existing data from DynamoDB to Elasticsearch?
                            
                                ElasticSearch - get all available filters (aggregate) from index
                            
                                failed to send join request to master elastic search 5.4 cluster
                            
                                Implementing Array.Except(Array2) > 0 query in elasticsearch filter?
                            
                                Letting only one elasticsearch pod come up on a node in Kubernetes
                            
                                Query to see if a field contains a string using Query DSL
                            
                                Use template to define sub-chart values with Helm
                            
                                Amazon Neptune Full Text Search - specify fields
                            
                                CloudWatch resource access policy error while creating Amazon Elasticsearch Service via Cloud Formation
                            
                                elasticsearch vs solr regarding data structure/query features
                            
                                Trouble with facet counts
                            
                                ElasticSearch incorrectly indexing and querying on non-alphanumeric characters
                            
                                find substring with special chars in Elastic Search
                            
                                ElasticSearch: EdgeNgrams and Numbers
                            
                                elasticsearch search phase execution
                            
                                expose elasticsearch service directly to the client or put it behind a middleware
                            
                                Access array in script_score

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

ElasticSearch Join Filter: Using subquery results as filter input possible?

Tags:

join

filter

subquery

elasticsearch

Tobi

People also ask

2 Answers

Ben at Qbox.io

Tobi

Recent Activity

Donate For Us