Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch Join Filter: Using subquery results as filter input possible?

I have a Use Case where I want to use ElasticSearch for realtime analytics. Within that, I want to be able to calculate some simple affinity scores.

Those are currently defined using the number of transactions a filtered-by-criteria user base performs, compared with the complete user base.

From my understanding, I'd need to do the following:

  1. Get the distinct transactions of my filtered user base
  2. Query for these transaction (types) in the complete user base
  3. Do the calculation (norming etc.)

To get the "distinct transactions" for the filtered user base, I currently use a Terms Filter Query with faceting which returns all terms (transaction types). As far as I understand, I's need to use this result as input of a Terms Filter Query for the second step to be able to receive the result I want.

I read that there's a pull request on GitHub which seems to implement this (https://github.com/elasticsearch/elasticsearch/pull/3278), but it's not really obvious to me whether this is already usable in a current release or not.

If not, are there some workarounds how I could implement this?

As additional info, here is my sample mapping:

curl -XPUT 'http://localhost:9200/store/user/_mapping' -d '
{
  "user": {
    "properties": {
      "user_id": { "type": "integer" },
      "gender": { "type": "string", "index" : "not_analyzed" },
      "age": { "type": "integer" },
      "age_bracket": { "type": "string", "index" : "not_analyzed" },
      "current_city": { "type": "string", "index" : "not_analyzed" },
      "relationship_status": { "type": "string", "index" : "not_analyzed" },
      "transactions" : {
        "type": "nested",
        "properties" : {
          "t_id": { "type": "integer" },
          "t_oid": { "type": "string", "index" : "not_analyzed" },
          "t_name": { "type": "string", "index" : "not_analyzed" },
          "tt_id": { "type": "integer" },
          "tt_name": { "type": "string", "index" : "not_analyzed" },
        }
      }
    }
  }
}'

So, for my actual desired result for my example Use Case, I'd have the following:

  1. My filtered user base would have this example filter: "gender": "male" & "relationship_status": "single". For these, I want to get the distinct transaction types (field "tt_name" of the nested document) and count the number of distinct user_ids.
  2. Next, I want to query my complete user base (no filter other than the list of transaction types from 1.) and count the number of distinct user_ids
  3. Do the "affinity" calculations
like image 250
Tobi Avatar asked Feb 17 '14 15:02

Tobi


People also ask

Can we join two indexes in Elasticsearch?

One challenge is that Elasticsearch does not support joins between indexes. This means that we need to find another way to combine the data from the two indexes. Another challenge is that the data in each index may be structured differently. This can make it difficult to combine the data from the two indexes.

How do I join Elasticsearch?

Joining queriesedit Instead, Elasticsearch offers two forms of join which are designed to scale horizontally. Documents may contain fields of type nested . These fields are used to index arrays of objects, where each object can be queried (with the nested query) as an independent document.

Where is Elasticsearch?

You can find elasticsearch. yml in /usr/share/elasticsearch/config/elasticsearch. yml (Docker) or /etc/elasticsearch/elasticsearch.


2 Answers

Here's a link to a runnable example:

http://sense.qbox.io/gist/9da6a30fc12c36f90ae39111a08df283b56ec03c

It presumes documents that look like:

{ "transaction_type" : "some_transaction", "user_base" : "some_user_base_id" }

The query is set to return no results, since aggregations take care of computing the stats you're looking for:

{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "distinct_transactions" : {
      "terms" : {
        "field" : "transaction_type",
        "size" : 20
      },
      "aggs" : {
        "by_user_base" : {
          "terms" : {
            "field" : "user_base",
            "size" : 20
          }
        }
      }
    }
  }
}

And here's what the result looks like:

  "aggregations": {
      "distinct_transactions": {
         "buckets": [
            {
               "key": "subscribe",
               "doc_count": 4,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "2",
                        "doc_count": 3
                     },
                     {
                        "key": "1",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "purchase",
               "doc_count": 3,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "1",
                        "doc_count": 2
                     },
                     {
                        "key": "2",
                        "doc_count": 1
                     }
                  ]
               }
            }
         ]
      }
   }

So, inside of "aggregations", you'll have a list of "distinct_transactions". The key will be the transaction type, and the doc_count will represent the total transactions by all users.

Inside of each "distinct_transaction", there's "by_user_base", which is another terms agg (nested). Just like the transactions, the key will represent the user base name (or ID or whatever) and the doc_count will represent that unique user base's # of transactions.

Is that what you were looking to do? Hope I helped.

like image 199
Ben at Qbox.io Avatar answered Sep 18 '22 00:09

Ben at Qbox.io


With the current version of ElasticSerach, there's the new significant_terms aggregation type, which can be used to calculate the affinity scores for my use case in a more simple way.

  • http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html#_recommending_based_on_statistics

All the to me relevant metrics can then be calculated in one step, which is very nice!

like image 33
Tobi Avatar answered Sep 19 '22 00:09

Tobi