I'm only a few days new to ElasticSearch, and as a learning exercise have implemented a rudimentary job scraper that aggregates jobs from a few job listing sites and populates an index with some data for me to play with. My index contains a document for each website that lists jobs. A property of each of these documents is a 'jobs' array, which contains an object for each job that exists on that site. I am considering indexing each job as its own document (especially since the ElasticSearch documentation says that inner_hits is an experimental feature) but for now, I am trying to see if I can accomplish what I want to do using the inner_hits and nested features of ElasticSearch. I am able to query, filter, and return back only matching jobs. However, I am not sure how to apply the same inner_hits constraints to an aggregation. This is my mapping: <pre class="prettyprint"><code>{ "jobsitesIdx" : { "mappings" : { "sites" : { "properties" : { "createdAt" : { "type" : "date", "format" : "dateOptionalTime" }, "jobs" : { "type" : "nested", "properties" : { "company" : { "type" : "string" }, "engagement" : { "type" : "string" }, "link" : { "type" : "string", "index" : "not_analyzed" }, "location" : { "type" : "string", "fields" : { "raw" : { "type" : "string", "index" : "not_analyzed" } } }, "title" : { "type" : "string" } } }, "jobscount" : { "type" : "long" }, "sitename" : { "type" : "string" }, "url" : { "type" : "string" } } } } } } </code></pre> This is a query and aggregate that I am trying (from Node.js): <pre class="prettyprint"><code>client.search({ "index": 'jobsitesIdx, "type": 'sites', "body": { "aggs" : { "jobs" : { "nested" : { "path" : "jobs" }, "aggs" : { "location" : { "terms" : { "field" : "jobs.location.raw", "size": 25 } }, "company" : { "terms" : { "field" : "jobs.company.raw", "size": 25 } } } } }, "query": { "filtered": { "query": {"match_all": {}}, "filter": { "nested": { "inner_hits" : { "size": 1000 }, "path": "jobs", "query":{ "filtered": { "query": { "match_all": {}}, "filter": { "and": [ {"term": {"jobs.location": "york"}}, {"term": {"jobs.location": "new"}} ] } } } } } } } } }, function (error, response) { response.hits.hits.forEach(function(jobsite) { jobs = jobsite.inner_hits.jobs.hits.hits; jobs.forEach(function(job) { console.log(job); }); }); console.log(response.aggregations.jobs.location.buckets); }); </code></pre> This gives me back all inner_hits of jobs in New York, but the aggregate is showing me counts for every location and company, not just the ones matching the inner_hits. Any suggestions on how to get the aggregate on only the data contained in the matching inner_hits? Edit: I am updating this to include an export of the mapping and index data, as requested. I exported this using Taskrabbit's elasticdump tool, found here: https://github.com/taskrabbit/elasticsearch-dump The index: http://pastebin.com/WaZwBwn4 The mapping: http://pastebin.com/ZkGnYN94 The above linked data differs from the sample code in my original question in that the index is named jobsites6 in the data instead of jobsitesIdx as referred to in the question. Also, the type in the data is 'job' whereas in the code above it is 'sites'. I've filled in the callback in the code above to display the response data. I am seeing only jobs in New York from the foreach loop of the inner_hits, as expected, however I am seeing this aggregation for location: <pre class="prettyprint"><code>[ { key: 'New York, NY', doc_count: 243 }, { key: 'San Francisco, CA', doc_count: 92 }, { key: 'Chicago, IL', doc_count: 43 }, { key: 'Boston, MA', doc_count: 39 }, { key: 'Berlin, Germany', doc_count: 22 }, { key: 'Seattle, WA', doc_count: 22 }, { key: 'Los Angeles, CA', doc_count: 20 }, { key: 'Austin, TX', doc_count: 18 }, { key: 'Anywhere', doc_count: 16 }, { key: 'Cupertino, CA', doc_count: 15 }, { key: 'Washington D.C.', doc_count: 14 }, { key: 'United States', doc_count: 11 }, { key: 'Atlanta, GA', doc_count: 10 }, { key: 'London, UK', doc_count: 10 }, { key: 'Ulm, Deutschland', doc_count: 10 }, { key: 'Riverton, UT', doc_count: 9 }, { key: 'San Diego, CA', doc_count: 9 }, { key: 'Charlotte, NC', doc_count: 8 }, { key: 'Irvine, CA', doc_count: 8 }, { key: 'London', doc_count: 8 }, { key: 'San Mateo, CA', doc_count: 8 }, { key: 'Boulder, CO', doc_count: 7 }, { key: 'Houston, TX', doc_count: 7 }, { key: 'Palo Alto, CA', doc_count: 7 }, { key: 'Sydney, Australia', doc_count: 7 } ] </code></pre> Since my inner_hits are limited to those in New York, I can see that the aggregation is not on my inner_hits because it is giving me counts for all locations.

You can achieve this by adding the same filter in your aggregation to only include New York jobs. Also note that in your second aggregation you had <code>company.raw</code> but in your mapping the <code>jobs.company</code> field has no <code>not_analyzed</code> part named <code>raw</code>, so you probably need to add it if you want to aggregate on the not analyzed company name. <pre class="prettyprint"><code>{ "_source": [ "sitename" ], "query": { "filtered": { "filter": { "nested": { "inner_hits": { "size": 1000 }, "path": "jobs", "query": { "filtered": { "filter": { "terms": { "jobs.location": [ "new", "york" ] } } } } } } } }, "aggs": { "jobs": { "nested": { "path": "jobs" }, "aggs": { "only_loc": { "filter": { <----- add this filter "terms": { "jobs.location": [ "new", "york" ] } }, "aggs": { "location": { "terms": { "field": "jobs.location.raw", "size": 25 } }, "company": { "terms": { "field": "jobs.company", "size": 25 } } } } } } } } </code></pre>

Aggregation on filtered, nested inner_hits query in ElasticSearch

Tags:

elasticsearch

I'm only a few days new to ElasticSearch, and as a learning exercise have implemented a rudimentary job scraper that aggregates jobs from a few job listing sites and populates an index with some data for me to play with.

My index contains a document for each website that lists jobs. A property of each of these documents is a 'jobs' array, which contains an object for each job that exists on that site. I am considering indexing each job as its own document (especially since the ElasticSearch documentation says that inner_hits is an experimental feature) but for now, I am trying to see if I can accomplish what I want to do using the inner_hits and nested features of ElasticSearch.

I am able to query, filter, and return back only matching jobs. However, I am not sure how to apply the same inner_hits constraints to an aggregation.

This is my mapping:

{
  "jobsitesIdx" : {
    "mappings" : {
      "sites" : {
        "properties" : {
          "createdAt" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "jobs" : {
            "type" : "nested",
            "properties" : {
              "company" : {
                "type" : "string"
              },
              "engagement" : {
                "type" : "string"
              },
              "link" : {
                "type" : "string",
                "index" : "not_analyzed"
              },
              "location" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "title" : {
                "type" : "string"
              }
            }
          },
          "jobscount" : {
            "type" : "long"
          },
          "sitename" : {
            "type" : "string"
          },
          "url" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

This is a query and aggregate that I am trying (from Node.js):

client.search({
  "index": 'jobsitesIdx,
  "type": 'sites',
  "body": {


    "aggs" : {
            "jobs" : {
                "nested" : {
                    "path" : "jobs"
                },
                "aggs" : {
                    "location" : { "terms" : { "field" : "jobs.location.raw", "size": 25 } },
                    "company" : { "terms" : { "field" : "jobs.company.raw", "size": 25 } }
                }
            }
        },


    "query": {
        "filtered": {
          "query": {"match_all": {}},
          "filter": {
            "nested": {
              "inner_hits" : { "size": 1000 },
              "path": "jobs",
              "query":{
                "filtered": {
                  "query": { "match_all": {}},
                  "filter": {
                    "and": [
                      {"term": {"jobs.location": "york"}},
                      {"term": {"jobs.location": "new"}}
                    ]
                  }
                }
              }
            }
          }
        }
      }
  }
}, function (error, response) {
    response.hits.hits.forEach(function(jobsite) {
    jobs = jobsite.inner_hits.jobs.hits.hits;

    jobs.forEach(function(job) {
        console.log(job);
    });

});

    console.log(response.aggregations.jobs.location.buckets);
});

This gives me back all inner_hits of jobs in New York, but the aggregate is showing me counts for every location and company, not just the ones matching the inner_hits.

Any suggestions on how to get the aggregate on only the data contained in the matching inner_hits?

Edit: I am updating this to include an export of the mapping and index data, as requested. I exported this using Taskrabbit's elasticdump tool, found here: https://github.com/taskrabbit/elasticsearch-dump

The index: http://pastebin.com/WaZwBwn4 The mapping: http://pastebin.com/ZkGnYN94

The above linked data differs from the sample code in my original question in that the index is named jobsites6 in the data instead of jobsitesIdx as referred to in the question. Also, the type in the data is 'job' whereas in the code above it is 'sites'.

I've filled in the callback in the code above to display the response data. I am seeing only jobs in New York from the foreach loop of the inner_hits, as expected, however I am seeing this aggregation for location:

[ { key: 'New York, NY', doc_count: 243 },
  { key: 'San Francisco, CA', doc_count: 92 },
  { key: 'Chicago, IL', doc_count: 43 },
  { key: 'Boston, MA', doc_count: 39 },
  { key: 'Berlin, Germany', doc_count: 22 },
  { key: 'Seattle, WA', doc_count: 22 },
  { key: 'Los Angeles, CA', doc_count: 20 },
  { key: 'Austin, TX', doc_count: 18 },
  { key: 'Anywhere', doc_count: 16 },
  { key: 'Cupertino, CA', doc_count: 15 },
  { key: 'Washington D.C.', doc_count: 14 },
  { key: 'United States', doc_count: 11 },
  { key: 'Atlanta, GA', doc_count: 10 },
  { key: 'London, UK', doc_count: 10 },
  { key: 'Ulm, Deutschland', doc_count: 10 },
  { key: 'Riverton, UT', doc_count: 9 },
  { key: 'San Diego, CA', doc_count: 9 },
  { key: 'Charlotte, NC', doc_count: 8 },
  { key: 'Irvine, CA', doc_count: 8 },
  { key: 'London', doc_count: 8 },
  { key: 'San Mateo, CA', doc_count: 8 },
  { key: 'Boulder, CO', doc_count: 7 },
  { key: 'Houston, TX', doc_count: 7 },
  { key: 'Palo Alto, CA', doc_count: 7 },
  { key: 'Sydney, Australia', doc_count: 7 } ]

Since my inner_hits are limited to those in New York, I can see that the aggregation is not on my inner_hits because it is giving me counts for all locations.

933

asked Sep 05 '15 15:09

mmccaff

1 Answers

You can achieve this by adding the same filter in your aggregation to only include New York jobs. Also note that in your second aggregation you had company.raw but in your mapping the jobs.company field has no not_analyzed part named raw, so you probably need to add it if you want to aggregate on the not analyzed company name.

{
  "_source": [
    "sitename"
  ],
  "query": {
    "filtered": {
      "filter": {
        "nested": {
          "inner_hits": {
            "size": 1000
          },
          "path": "jobs",
          "query": {
            "filtered": {
              "filter": {
                "terms": {
                  "jobs.location": [
                    "new",
                    "york"
                  ]
                }
              }
            }
          }
        }
      }
    }
  },
  "aggs": {
    "jobs": {
      "nested": {
        "path": "jobs"
      },
      "aggs": {
        "only_loc": {
          "filter": {            <----- add this filter
            "terms": {
              "jobs.location": [
                "new",
                "york"
              ]
            }
          },
          "aggs": {
            "location": {
              "terms": {
                "field": "jobs.location.raw",
                "size": 25
              }
            },
            "company": {
              "terms": {
                "field": "jobs.company",
                "size": 25
              }
            }
          }
        }
      }
    }
  }
}

180

answered Nov 05 '22 02:11

Val

Related questions
                            
                                XContentBuilder Elasticsearch mapping for inner objects
                            
                                DeleteByQuery using NEST and ElasticSearch
                            
                                how start elasticsearch with debug / verbose
                            
                                Does Elasticsearch stream results?
                            
                                Displaying calculated fields in Kibana 4
                            
                                Why Hadoop or Spark? There is ElasticSearch
                            
                                logstash output to elasticsearch with document_id; what to do when I don't have a document_id?
                            
                                Why is that after deleting an index in logstash, Kibana still displays it?
                            
                                Escape elasticsearch special characters in PHP
                            
                                Sending json format log to kibana using filebeat, logstash and elasticsearch?
                            
                                parse_exception - request body is required
                            
                                How to authenticate Logstash output to a secure Elasticsearch URL (version 5.6.5)
                            
                                upload stopwords and synonyms to Elasticsearch cloud server
                            
                                How do I delete an Index using NEST 7.4.1?
                            
                                Fuzzy Like This (FLT) - ElasticSearch
                            
                                Custom Analyzer elasticsearch-rails
                            
                                How to build task 'elasticsearch:import:model'
                            
                                Elasticsearch : aggregation "existing" fields
                            
                                understanding how elasticsearch stores dates internally
                            
                                ElasticsearchParseException: malformed, expected settings to start with 'object', instead was [VALUE_STRING]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With