I have a temporary index with documents that I need to moderate. I want to group these documents by the words they contain. For example, I have these documents: 1 - "aaa bbb ccc ddd eee fff" 2 - "bbb mmm aaa fff xxx" 3 - "hhh aaa fff" So, I want to get the most popular words, ideally with counts: "aaa" - 3, "fff" - 3, "bbb" - 2, etc. Is this possible with elasticsearch?

Doing a simple term aggregation search will meet your needs: (where <code>mydata</code> is the name of your field) <pre class="prettyprint"><code>curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{ "query": { "match_all" : {} }, "aggs" : { "mydata_agg" : { "terms": {"field" : "mydata"} } } }' </code></pre> will return: <pre class="prettyprint"><code>{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "mydata_agg" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "aaa", "doc_count" : 3 }, { "key" : "fff", "doc_count" : 3 }, { "key" : "bbb", "doc_count" : 2 }, { "key" : "ccc", "doc_count" : 1 }, { "key" : "ddd", "doc_count" : 1 }, { "key" : "eee", "doc_count" : 1 }, { "key" : "hhh", "doc_count" : 1 }, { "key" : "mmm", "doc_count" : 1 }, { "key" : "xxx", "doc_count" : 1 } ] } } } </code></pre>

It might be because this question and the accepted answer are some years old, but now there is a better way. The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on. This is usually the case for fields that contain data of type <code>text</code> and not <code>keyword</code>. This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation. From the docs: <ul> <li>It is specifically designed for use on type <code>text</code> fields</li> <li>It does not require field data or doc-values</li> <li>It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.</li> </ul> It can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler. So, in your case you would send a query like this (leaving out the filtering/sampling): <pre class="prettyprint"><code>{ "aggs": { "keywords": { "significant_text": { "field": "myfield" } } } } </code></pre>

Elasticsearch - How to get popular words list of documents

2 Answers

Doing a simple term aggregation search will meet your needs:

(where mydata is the name of your field)

curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
  "query": {
    "match_all" : {}
  },
  "aggs" : {
      "mydata_agg" : {
    "terms": {"field" : "mydata"}
    }
  }
}'

will return:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "mydata_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "aaa",
        "doc_count" : 3
      }, {
        "key" : "fff",
        "doc_count" : 3
      }, {
        "key" : "bbb",
        "doc_count" : 2
      }, {
        "key" : "ccc",
        "doc_count" : 1
      }, {
        "key" : "ddd",
        "doc_count" : 1
      }, {
        "key" : "eee",
        "doc_count" : 1
      }, {
        "key" : "hhh",
        "doc_count" : 1
      }, {
        "key" : "mmm",
        "doc_count" : 1
      }, {
        "key" : "xxx",
        "doc_count" : 1
      } ]
    }
  }
}

160

answered Oct 08 '22 21:10

Olly Cruickshank

It might be because this question and the accepted answer are some years old, but now there is a better way.

The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.

This is usually the case for fields that contain data of type text and not keyword.

This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
From the docs:

It is specifically designed for use on type text fields
It does not require field data or doc-values
It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

It can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.

So, in your case you would send a query like this (leaving out the filtering/sampling):

{
    "aggs": {
        "keywords": {
            "significant_text": {
                "field": "myfield"
            }
        }
    }
}

answered Oct 08 '22 20:10

Aron Fiechter

Related questions
                            
                                Are there conventions for naming/organizing Elasticsearch indexes which store log data?
                            
                                Transport SSL must be enabled if security is enabled on a [basic] license
                            
                                What is the difference between enabled : false and index : 'no' in elasticsearch?
                            
                                Different Elasticsearch results for the same query
                            
                                How to integrate ElasticSearch 7.0 version with Spring Boot?
                            
                                Persist Elastic Search Data in Docker Container
                            
                                Too many open files warning from elasticsearch
                            
                                How and where to implement basic authentication in Kibana 3
                            
                                ElasticSearch setup for a large cluster with heavy aggregations
                            
                                Count distinct values using elasticsearch
                            
                                How to boost exact match over multi match in elastic search
                            
                                Spring Data Elasticsearch id vs. _id
                            
                                Change dynamically elasticsearch synonyms
                            
                                Exact search in array object type using elasticsearch
                            
                                NEST: How to query against multiple indices and handle different subclasses (document types)?
                            
                                Set update_all_types to true on ElasticSearch
                            
                                Elasticsearch - Filter where (one of nested array) and (all of nested array)
                            
                                Elasticsearch date format
                            
                                How to do a wildcard or regex match on _id in elasticsearch?
                            
                                Elasticsearch can't update non dynamic settings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Elasticsearch - How to get popular words list of documents

Tags:

elasticsearch

o139

People also ask

2 Answers

Olly Cruickshank

Aron Fiechter

Recent Activity

Donate For Us